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Abstract 



In our daily lives, organizing resources into a set of categories is a common task. 
Organizing resources into categories makes searching through those resources 
easier by limiting the focus to a specific category. Limiting the focus significantly 
reduces the amount of information one must search. Categorization becomes 
more useful as the collection of resources increases, when managing resources 
becomes more and more difficult if they are not organized appropriately. Large 
collections like those made up by books, movies, and web pages, for instance, are 
usually cataloged in libraries, organized in databases and classified in directories, 
respectively. However, the usual largeness of these collections requires a vast 
endeavor and an outrageous expense to organize manually. 

Recent research is moving towards developing automated classifiers that re- 
duce the increasing costs and effort of the task. Most of the research in this field 
has focused on self-content, where the publisher is the only author, as a data 
source to discover the aboutness of the resource. Self-content presents the prob- 
lem that it is not always representative enough, and sometimes it is difficult to 
access depending on the type of resource. Little work has been done analyzing 
the appropriateness of and exploring how to harness the annotations provided 
by users on social tagging systems as a data source. Users on these systems save 
resources as bookmarks in a social environment by attaching annotations in the 
form of tags. It has been shown that these tags facilitate retrieval of resources 
not only for the annotators themselves but also for the whole community. Like- 
wise, these tags provide meaningful metadata that refers to the content of the 
resources. 

In this thesis, we deal with the utilization of these user-provided tags in search 
of the most accurate classification of resources as compared to expert-driven cat- 
egorizations. After performing a set of experiments to choose a suitable classifier 
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for this kind of task, we explore social annotations looking for a way to best 
use them. For this purpose, we have created three large-scale datasets including 
tagging data for resources from well-known social tagging systems: Delicious, 
LibraryThing, and GoodReads. Those resources are accompanied by categoriza- 
tion data from sound and consolidated expert-driven taxonomies. From these 
resources the appropriateness of social tags for predicting categories can be eval- 
uated. 

Specifically, we first study several ways of representing the massive number 
of social tags by amalgamating the contributions of large communities of users. 
We analyze their suitability for the classification task, upon both broader top level 
categories and narrower deep level categories. Then, we explore the nature, char- 
acteristics, and distributions of tags in folksonomies, in order to determine how 
the settings of each system affect the tagging behavior and the usefulness of tags 
for the classification task. We go deeper into tag distributions by analyzing the 
usefulness of weighting schemes based on inverse frequency values. Finally, us- 
ing state-of-the-art user behavior detection processes, we identify users on social 
tagging systems who better fit the classification task. 

To the best of our knowledge, this is the first research work performing actual 
classification experiments utilizing social tags. By exploring the characteristics 
and nature of these systems and the underlying folksonomies, this thesis sheds 
new light on the way of getting the most out of social tags for the sake of auto- 
mated resource classification tasks. Therefore, we believe that the contributions 
in this work are of utmost interest for future researchers in the field, as well as for 
the scientific community in order to better understand these systems and further 
utilize the knowledge garnered from social tags. 
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Introduction 



"Ideals are like stars; you will not succeed in touching them with your hands. But like 
the seafaring man on the desert of waters, you choose them as your guides, and following 
them you will reach your destiny. " 
— Carl Schurz 

1.1 Motivation 

Organizing resources into predefined categories is a natural idea in our daily 
lives. Assigning categories to resources helps facilitate the search for resources 
by reducing the focus to a specific category or categories. Categorization effec- 
tively reduces the amount of resources one has to search. For instance, librarians 
usually organize books into groups of related subjects. Also, movie databases, 
music catalogs, and file systems, among others, tend to be categorized in a way 
that eases access to their resources. Likewise, web directories such as the Yahoo! 
Directory and the Open Directory Project organize web pages into categories. 
Web page classification can substantially enhance search engines by reducing the 
scope of results to the category of user's interest (Qi and Davison, 2009). 

The process of manually categorizing resources becomes expensive as the 
collection of resources grows. For instance, the Library of Congress reported 
that the average cost of cataloging each bibliographic record by professionals was 
$94.58 in 2002 1 . For the 291,749 records they cataloged that year, the total cost 
came to more than $27.5 million. Given the expensiveness of this task, switching 
to automated classifiers seems to be a good alternative to facilitate the task and 
keep catalogs updated by reducing manual effort. 

Until now, most of the automated classifiers rely on the content of the re- 
sources, especially regarding web page classification tasks (Qi and Davison (2009)). 

1 http:/ / www.loc.gov/loc/lcib/ 0302/ collections.html 
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Nonetheless, the lack of representative data within many resources makes the 
classification task more complicated. In some cases, it may not be feasible to 
obtain enough data for certain kinds of resources such as books or movies. For 
example, usually the full text of books is not available, and it is not easy to repre- 
sent movies as text or processable data. Without sufficient data, representing the 
content becomes more challenging. 

As a means to solve these issues, social tagging systems provide an easier and 
cheaper way to obtain metadata related to resources. Social tagging systems are 
a means to save, organize, and search resources, by annotating them with tags 
that the user provides. Systems like Delicious 2 , LibraryThing 3 and GoodReads 4 
collect user annotations in the form of tags on their respective collections of re- 
sources. These user-generated tags give rise to meaningful data describing the 
content of the resources (Heymann et al., 2008). User-provided annotations can 
be useful as a data source by providing meaningful information that can help 
infer the categorization of the resources. Our hypothesis is that these large col- 
lections of annotations can enhance the automated resource classification task in 
a noticeable manner. 

By providing tags, users are creating their own categorization system for the 
given resource. The aggregation of users in an active community can create many 
bookmarks, tags, and therefore annotated resources. With more users contribut- 
ing bookmarks and tags to these systems, the more accurately these resources can 
be annotated. 

"Each individual categorization scheme is worth less than a professional 
categorization scheme. But there are many, many more of them" , Joshua 
Schachter, founder of Delicious, at the 2006 FOWA summit in Lon- 
don, England 5 . 

Given that a large number of users are providing their own annotations on 
each resource, our objective is focused on finding out an approach to amalgamate 
their contributions in such a way that resembles the categorization by profession- 
als. In this context, where users are providing large amounts of metadata, our 
challenge lies in making the most of them in order to enhance resource catego- 
rization tasks. 

"We've entered an era where data is cheap, but making sense of it is not" , 
Danah Boyd, Social Media Researcher at Microsoft Research New 
England, at the WWW2010 conference in Raleigh, North Carolina, 

2 http: / / delicious.com 

3 http: / / www.librarything.com 

4 http: / / www.goodreads.com 

5 http: / / simonwillison.net/ 2006/Feb /8/ summit/ 
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United States 6 . 

1.1.1 Resource Classification 

Resource classification can be defined as the task of labeling and organizing re- 
sources within a set of predefined categories. In this work, we use Support Vector 
Machines (SVM, Joachims (1998)), a state-of-the-art classification approach. This 
type of classification relies on previously categorized or labeled training sets of 
resources. The classifier uses these sets of resources to gather knowledge which, 
in turn, is used to classify new unknown resources. 

Different settings can be used for resource different classification problems. 
The system's learning technique may be supervised or semi-supervised. Supervised 
learning requires that all training resources are previously categorized where 
semi-supervised learning permits unlabeled resources to be taken into account 
during the learning phase. Classification may be binary, where only two pos- 
sible categories can be assigned to each resource, or multiclass, where three or 
more categories can be assigned. Binary classification systems are commonly used 
for filtering systems -e.g., an email application that filters out spam messages- 
, whereas the multiclass systems are necessary for thematic classification with 
larger taxonomies -i.e., classification by topic or subject. 

For thematic classification on large collections of resources, like web pages 
on the Web, or books in libraries, the taxonomies are usually defined by more 
than two categories, and the subset of previously labeled resources tends to be 
tiny. Accordingly, we believe that the application of both semi-supervised and 
multiclass approaches should be considered and analyzed to perform this kind 
of task. 

In this thesis, we propose the analysis of several classification approaches 
using SVM, with the aim of analyzing their suitability to these tasks. These in- 
clude different approaches to solving multiclass problems, as well as the study of 
supervised and semi-supervised algorithms. 

1.1.2 Social Annotations 

Social tagging sites allow users to save and annotate their favorite resources - 
e.g., web pages, movies, books, photos or music-, socially sharing them with 
the community. These annotations are usually provided by users in the form of 
tags. Tagging is an open way to assign tags or keywords to resources, in order 
to describe and organize them. It enables the later retrieval of resources in an 
easier way, using tags as metadata describing resources. Usually, there are no 



6 http://www.danah.org/papers/talks/2010/WWW2010.html 
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predefined tags, and therefore users can freely choose the words they want as 
tags. 

"Tagging is mostly user interface - a way for people to recall things, what 
they were thinking about when they saved it. Fairly useful for recall, OK 
for discovery, terrible for distribution (where publishers add as many tags 
as possible to get it in lots of boxes).", Joshua Schachter, founder of De- 
licious, at the 2006 FOWA summit in London, England 7 . 

This tagging process generates a tag structure so-called folksonomy on a so- 
cial tagging system, i.e., a user-driven organization of resources. Folksonomy is a 
portmanteau of the words folk (people), taxis (classification) and nomos (manage- 
ment). It is also known as a community-based taxonomy, where the classification 
scheme is non-hierarchical, as opposed to a classical taxonomy-based categoriza- 
tion scheme. Thus, a folksonomy has to do with expert-driven taxonomies, inso- 
far as resources are labeled and put together into groups. 

These annotations are said to belong to a social environment when they are 
accessible and profitable by any user. This feature enables searching resources 
by taking advantage of annotations provided by others. This encourages the 
contribution of large communities of users. 

Not all the annotations are shared in the same way, though. The social tag- 
ging site itself may establish some constraints, mainly by setting who is able to 
annotate each resource. In this regard, two kinds of systems can be distinguished 
(Smith, 2008): 

• Simple tagging systems: users can describe their own resources, such as 
photos on Flickr 8 , news on Digg 9 or videos on Youtube 10 , but nobody an- 
notates others' resources. Usually, the author of the resource is who anno- 
tates it. This means that no more than one user tags a resource. In a simple 
tagging system, there is a set of users (U), who are annotating resources 
(R) using tags (T). A user m, £ U annotates their resource Tj E R with a set 
of tags Tj = {tji, tjp}, with a variable number p of tags. The set of tags 
assigned to rj will always be limited to Tj, since nobody else can annotate 
it. 

• Collaborative tagging systems: many users annotate the same resource, 
and all of them can tag it with tags in their own vocabulary. The collec- 
tion of tags assigned by a single user creates a smaller folksonomy, also 

7 http: / / simonwillison.net/ 2006/Feb /8/ summit/ 
8 http:/ / www.flickr.com 
9 http: / / digg.com 
10 http:/ /www.youtube.com 
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known as personomy. As a result, several users tend to post the same re- 
source. For instance, CiteULike 11 , LibraryThing and Delicious are based on 
collaborative annotations, where each resource (papers, books and URLs, 
respectively) can be annotated and tagged by all the users who consider it 
interesting. A collaborative tagging system is more complex than a simple 
one, where there is a set of users (If), who are posting bookmarks (B) for 
resources (R) annotated by tags (T). Each user «; £ U can post a bookmark 
G B of a resource Tj 6 R with a set of tags = {i-m, with a 

variable number p of tags. After k users posted r,-, it is described with a 
weighted set of tags T; = {ivntji, zvj n tj n }, where Wji, < A: represent 
the number of assignments of a specific tag. Accordingly, each bookmark is 
a triple of a user, a resource, and a set of tags: : w, x r; x T«. Thus, each 
user saves bookmarks of different resources, and a resource has bookmarks 
posted by different users. The result of aggregating tags within bookmarks 
by a user is known as the personomy of the user: T, = {wntn, u>i m tj m }, 
where m is the number of different tags in user's personomy. 

Figure 1.1 shows an example comparing the behavior of both systems. 




Figure 1.1: Comparison of user annotations on simple and collaborative tagging 
systems. 

In this thesis, we will focus on collaborative tagging systems. Tags present a 
high likelihood of coincidence across users annotating the same resource, making 
the aggregated tags of collaborative tagging systems especially strong rather than 
simple tagging systems, i.e., multiple users annotate the same resource. 

In a collaborative tagging system, for instance, a user could tag this work as 
social-tagging, research, and thesis, whereas another user could use the 
tags social-tagging, social-bookmarking, phd, and thesis to annotate 
it. Users' behavior may considerably differ on these systems. Because of this, the 

1 1 http : / / www. citeulike . org 
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aggregation of their annotations is usually considered as the consensus. For in- 
stance, the aggregation of the above annotations would be the following: thesis 
(2), social-tagging (2), social-bookmarking (1), phd (1), and research 
(1). In this example the values represent the weighted union of all tags. 

In this thesis, we analyze and study the annotations provided by end users 
on social tagging systems. We present different methods to use these annotations 
to classify resources as accurately as possible. Specifically, we focus on the anal- 
ysis of the usefulness of tags on user-driven folksonomies as a means to get an 
organization that resembles the categories on expert-driven taxonomies. In this 
context, we study several representations of social annotations, in search of an 
approach that resembles the classification by experts as much as possible. Espe- 
cially, we focus on getting the most out of social tags, by both looking for the best 
representation, and measuring the impact of the distribution of tags across the 
triple of resources, bookmarks and users. Finally, we also study the application 
of state-of-the-art user behavior analysis approaches for the detection of users 
who rather provide tags for categorization purposes. 

1.2 Scope of the Thesis 

In this thesis, we investigate how annotations gathered together on social tag- 
ging systems can be harnessed for resource classification. Specifically, this thesis 
focuses on the study of several resource representation approaches using social 
tags. We perform the evaluation of such representations by measuring their sim- 
ilarity to classifications by experts. In this context, we consider the classification 
provided by experts as a ground truth for the evaluation process. We perform 
the classification experiments by using a state-of-the-art classification method, 
so-called Support Vector Machines. To choose the appropriate settings for the 
classifier, we also perform a preliminary study in this regard. 

As meaningful metadata to enhance the resource classification task, we ex- 
plore social tags provided by users from a statistical and distributional point of 
view, and we do not consider other details such as analyzing their linguistic and 
semantic meanings. For us, each text string representing a tag is treated as a dif- 
ferent token, regardless of its meaning. Thereby, rather than analyzing the mean- 
ing of tags, we focus on analyzing the structure of folksonomies, represented by 
triples of users, bookmarks and resources. 

1.3 Problem Statement and Research Questions 

The main goal of this thesis is to shed new light on the appropriate use of the great 
deal of data gathered on social tagging systems. Given the interest of classifying 
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resources, and the lack of representative data in many cases, we aim at analyzing 
the extent to which and how social tags can enhance a resource classification task. 
At the beginning of this work, we found no works dealing with this insofar as no 
special attention had been paid at how to represent resources using social tags, 
and no actual classification experiments had been performed. Thus, we were 
motivated to carry out this research work. To that end, we set forth the following 
problem statement, which summarizes the main focus of this thesis: 

Problem Statement 

How can the annotations provided by users on social tagging systems be exploited 
to yield the most accurate resource classification task? 

Regarding the classification algorithm, we rely on Support Vector Machines 
(SVM) as a state-of-the-art classification method. Using this method, several ap- 
proaches have already been proposed to work on binary and multiclass scenarios, 
as well as supervised and semi-supervised ones. Nonetheless, there is little work 
comparing different approaches in the multiclass scenario. We assume that these 
kinds of tasks are usually multiclass, and the number of prior annotated resources 
tends to be tiny as compared to the whole collection of resources. Accordingly, 
the first two research questions we formulate in this thesis are: 

Research Question 1 

What kind of SVM classifiers should be used to perform this kind of classification 
tasks: a native multiclass classifier, or a combination of binary classifiers? 

Research Question 2 

What kind of learning method performs better for this kind of classification tasks: 
a supervised one, or a semi-supervised one? 

Moreover, regarding social annotations, it has been shown that they provide 
useful metadata for improving resource management. Nevertheless, there is little 
work analyzing the usefulness of social tags for performing classification tasks. 
Preliminary analyses have shown encouraging results, and conclude that these 
annotations may be helpful for classification. However, they did not analyze the 
annotations in more depth, and it is not clear whether the representation they 
used was good enough. 

We believe that several factors should be taken into account when represent- 
ing resources using social tags. In contrast to classical document repositories, 
social annotations rely on a triple of users, resources, and tags, which should be 
analyzed in more depth for the representation task. In this context, apart from 
representing the resources, it is worthwhile considering that not all the tags have 
to be equally representative, and not all the users provide equally good annota- 
tions. In this thesis, our main goal is to deepen on the way social annotations 
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can be used to the greatest extent in search of an accurate classification of the 
resources. Based on these ideas, we formulate the following research questions: 

Research Question 3 

How do the settings of social tagging systems affect users' annotations and the 
resulting folksonomies ? 

Research Question 4 

What is the best way of amalgamating users' aggregated annotations on a resource 
in order to get a single representation for a resource classification task? 

Research Question 5 

Despite of the usefulness of social tags for these tasks, is it worthwhile considering 
their combination with other data sources like the content of the resource as an 
approach to improve the results even morel 

Research Question 6 

Are social tags also useful and specific enough to classify resources into narrower 
categories as in deeper levels of hierarchical taxonomies? 

Research Question 7 

Can we further consider the distribution of tags across the collection so that we can 
measure the overall representativity of each tag to represent resources? 

Research Question 8 

What is the best approach to weigh the representativity of tags in the collection for 
resource classification? 

Research Question 9 

Can we discriminate different user profiles so that we can find a subset of users 
who provide annotations that better fit a classification scheme? 

Research Question 10 

What are the features that identify a user as a good contributor to the resource 
classification ? 

1.4 Research Methodology 

The research methodology we followed throughout this work includes 6 parts: 

1. Review of the literature and understanding of social tagging systems. 

2. Looking for an appropriate SVM classifier to perform the work. 

3. Looking for existing social tagging datasets. Since we did not find any 
that fulfilled our requirements, we created three large-scale social tagging 
datasets instead. 
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4. Thinking of and proposing approaches to classifying using social tags. 

5. Evaluating the proposed approaches. 

6. Performing a thorough analysis of the results, in order to understand them 
for drawing conclusions. 

7. Showing and presenting partial results at several national and international 
conferences and workshops, in order to get useful comments and feedback 
from other researchers. 

8. Summarizing the research, contributions, and conclusions drawn through- 
out this work by writing this dissertation. 

Step 4 through 6 was an iterative process. 



1.5 Structure of the Thesis 

This thesis consists of 8 chapters. Below we provide a brief overview summariz- 
ing the contents of each of these chapters. 

Chapter 1 on page 21 
Introduction 

We present the motivation for the study on the use of social annotations for 
resource classification. We formalize the problem, and motivate the need 
of such a study. 

Chapter 2 on page 33 
Related Work 

We provide a survey of previous works in the field. We summarize the 
advances in related fields, not only on the use of social annotations, but 
also on resource classification. 

Chapter 3 on page 47 

Support Vector Machines for Large-Scale Classification 

We perform a study on different SVM approaches to the problem of clas- 
sifying large-scale resource collections on multiclass taxonomies. It gives 
rise to the best SVM approach, which we use to perform the rest of the 
classification experiments along the work. 

Chapter 4 on page 59 

Generation of Social Tagging Datasets 

We describe and analyze in detail the social tagging datasets we created. 
We detail in depth the process of creation of such datasets, and we analyze 
the main characteristics of the underlying folksonomies. 
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Chapter 5 on page 75 

Representing the Aggregation of Tags 

We propose and evaluate different representations of resources based on 
social tags for a resource classification task. We study the usefulness of 
social tags as compared to other data sources, and propose the best rep- 
resentation approach to get the most out of them. We also deal with the 
combination of social tags with other data sources to yield a better perfor- 
mance. 

Chapter 6 on page 95 

Analyzing the Distribution of Tags for Resource Classification 

We deal with the task of considering the representativity of tags within 
a collection of social annotations on a social tagging system for resource 
classification. We study the application of weighting schemes adapted to 
social tagging systems, and analyze their suitability by taking into account 
the settings of each system. 

Chapter 7 on page 111 

Analyzing the Behavior of Users for Classification 

We explore the effect of user behavior on social tagging systems for the 
resource classification task. Previous works suggest the existence of two 
types of users: Categorizers, who use tags to categorize resources, and 
Describers, who use tags to describe resources. Based on these works, we 
study whether tags by Categorizers are better than tags by Describers for 
the resource classification. 

Chapter 8 on page 125 

Conclusions and Future Research 

We discuss and summarize the main conclusions and contributions of the 
work. We present the answers to the formulated research questions, and 
the outlook on future directions of the work. 

Additionally, the thesis contains the following appendices at the end, with 
complementary information and summaries in other languages: 

Appendix A on page 143 
Additional Results 

We present some additional results, which we did not include in the main 
content of the thesis, but are also worth including to prove and help under- 
stand some conclusions. 

Appendix B on page 145 

Key Terms and Definitions 

We list the most relevant terms related to social tagging systems, and pro- 
vide a detailed definition of them. 
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Appendix C on page 147 
List of Acronyms 

We provide a list of the acronyms used along the work, and what they 
stand for. 

Appendix D on page 149 

Resumen (Spanish Summary) 

We summarize the contents of this work in Spanish language. 

Appendix E on page 167 

Laburpena (Basque Summary) 

We summarize the contents of this work in Basque language. 

1.6 Writing Conventions 

Next, we detail some conventions we defined while writing this thesis. These 
conventions include formatting of text, and some issues regarding English lan- 
guage. 

1.6.1 Formatting 

In the thesis, we mention names of tags many times, either to show them as 
examples or to clarify some explanations. When those tags appear in the text, we 
use a monospaced typeface to differentiate them easily from the rest of the text. 
For instance: reference. 

In the same manner, we emphasize with italic text those inline appearances 
of math formulas, or terms that for some reason have certain importance in the 
context. 

1.6.2 Language Issues 

This thesis, being focused on social media, deals with users of social tagging 
systems at some points. When we refer to a single user, but no distinction is 
made between genders, we use the pronoun they instead of either he or she. For 
instance: 

When a user saves a bookmark, the tags annotated by them are added to their per- 
sonomy. 

This is grammatically incorrect in English. However, a person's gender is 
explicit in the third person singular pronouns, and there is no perfect solution to 
this issue. Sometimes, the wording he/she is used, but using it all along this work 
would become cumbersome, and would harm its readability. We rely on tips by 
the Oxford English Dictionary for this decision 12 . 

12 http:/ / www.oxforddictionaries.com/page/heshethey/he-or-she-versus-they 
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"If you have an apple and I have an apple and we exchange these apples then you and I 
will still each have one apple. But if you have an idea and I have an idea and we exchange 
these ideas, then each of us will have two ideas. " 

— George Bernard Shaw 



This chapter introduces the previous work we found in the liter- 
ature. Specifically, the works in the research areas related to this 
work are put together, summarized, and contextualized. Next, 
in Section 2.1 on the next page we define and provide a back- 
ground on the resource classification problem. In Section 2.2 
on page 35 we summarize the previous efforts towards an SVM 
approach that enables the classification of resources within an 
environment where the taxonomy is multiclass (i.e., made up by 
more than two classes), and the number of labeled resources use 
to be tiny as compared to the unlabeled ones. In Section 2.3 on 
page 40 we summarize the works in which annotations from so- 
cial tagging systems have been profited to enhance information 
search, management and access. Specifically, we first summa- 
rize in Subsection 2.3.1 on page 41 the use of social annotations 
to enhance information management tasks. Then, we present in 
detail in Subsection 2.3.2 on page 42 the works regarding the 
use of social annotations for classification. The latter is the most 
important topic for this thesis, but due to the novelty of the re- 
search field and the lack of work on it, it does not extend as 
much as the former. 
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2.1 Resource Classification 

Resource classification is the task of assigning categories from a predefined tax- 
onomy to a set of resources. Formally, it consists of associating a Boolean value to 
each pair (ry,c,-) 6RxC, where R = {fi, ...,T\g\} is the set of resources, and C = 
{ci, C|c| } is the set of predefined categories. The goal of the task aims at letting 
the classifier give predictions by means of the function f : R x C 4 {T, F}, in 
such a way that it resembles as much as possible the function <^:RxC— x{T, F}, 
which defines the ideal classification of the resources (Sebastiani, 2002). Upon 
this, several settings can define a different classification approach. Next, we 
briefly define the main settings. 

Usually, a classification task comprises two subsets of resources when it relies 
on a machine learning approach. Some of the resources are already labeled with 
corresponding categories, and others are unlabeled. The former are used by the 
classifier to learn the characteristics of each category, creating a model for each 
category after the learning process. The latter are the instances to be predicted 
by the classifier. Relying on the models created during the learning phase, the 
classifier provides a category for each unlabeled resource as its prediction. 

2.1.1 Binary and Multiclass Classification 

As regards to the taxonomies with predefined categories considered in the clas- 
sification, where the resources are organized, the task is said to be either binary 
or multiclass. Even though the sole apparent difference is the number of classes 
making up the taxonomy -2 for the binary, and 3 or more for the multiclass-, the 
tasks tend to follow a different goal. 

A binary classification is usually part of a filtering process, where the classes 
are the positive and the negative case. These tasks aim at separating the resources 
that want to be considered from the resources that want to be ruled out. For 
instance, a common binary classifier used as a filter is an email application that 
keeps the interesting messages in the inbox, whereas it sends the unwanted stuff 
to the spam folder. 

A multiclass classification involves a larger taxonomy, and is usually used on 
thematic classification, i.e., where the categories represent the aboutness of the 
underlying resources. This kind of classification enables to organize resources 
into groups of related matters, and it has several applications such as creating 
directories of resources to ease later browsing, providing customized suggestions 
by users' topics of interest, or allowing to handle resources from different cate- 
gories in a separate way, among many others. 

In this thesis, we deal with thematic classification and, thus, we consider the 
task to be multiclass. 
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2.1.2 Single-label vs Multilabel Classification 

The number of categories or labels that can be assigned to each resource is another 
setting that describes a resource classification. The number of labels for a single 
resource can be constrained to just one, thus becoming a single-label task, or 
it must be extended to allow more labels or even unlimited, when it is called 
multilabel. In practice, it refers to whether a resource can be related to several 
categories, or it can be included into just one. Besides the classification task itself, 
this feature also modifies the subsequent organization and browsing of categories. 

In this thesis, we focus on single-label classification, mainly because the tax- 
onomies we use as the ground truth provide this kind of categorization data. 

2.1.3 Semi-supervised vs Supervised Classification 

With regard to the learning method used by the classifier, it can vary in the in- 
stances considered to learn and create the model. A supervised learning method 
learns from the instances in the training set, and creates a model from them. 
A semi-supervised goes further by also considering unlabeled instances in the 
learning method. After creating the model from labeled instances, it includes its 
predictions on the unlabeled instances in the learning process enabling an incre- 
mental evolution of the model. The latter is especially useful when the training 
set is small, and the lack of sufficient learning data is worth an upsize of the 
labeled data. 

Based upon these two learning methods, we summarize the related work on 
the use of SVM classifiers in the next section. We rely on SVM as a state-of-the-art 
classification algorithm widely used in the field. 

2.2 Support Vector Machines for Classification 

In the last decade, SVM has become one of the most widely studied techniques 
for text classification, due to the positive results it has shown. This technique uses 
the vector space model to represent the resources, and assumes that resources in 
the same class should fall into separable spaces of the representation. Upon this, 
it looks for a hyperplane that separates the classes; therefore, this hyperplane 
should maximize the distance between it and the nearest resources, which is 
called the margin. Equation 2.1 defines such a hyperplane (see Figure 2.1 on the 
following page). 

/(x)=w-x + & (2.1) 

In order to resolve this function, though, all the possible values should be 
considered and, after that, the values of w and b that maximize the margin should 
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Figure 2.1: An example of binary SVM classification, separating two classes (black 
dots from white dots). Source: Wikimedia Commons. 



be selected; this would be computationally expensive. The equivalent Equation 
2.2 is thus used to relax it (Boser et al., 1992; Cortes and Vapnik, 1995): 
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Subject to: 



where C is the penalty parameter, is an stack variable for the i th resource, 
I is the number of labeled resources, and d is the sigma parameter which defines 
the non-linear mapping from the input space to some high-dimensional feature 
space. 

When the value of d is set to 1, this function can only solve linearly separable 
problems. The use of a kernel function is sometimes required for the redimen- 
sion of the space. This redimension creates a new space with higher number of 
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dimensions, which enables a linear separation. After that, the redimension is un- 
done, so the hyperplane will be transformed to the original space, respecting the 
classification function. Best-known kernel functions include linear, polynomial, 
radial basis function (RBF) and sigmoid, among others. Different kernel func- 
tions' performance has been studied in Scholkopf et al. (1999) and Kivinen and 
Williamson (2002). Linear kernel is most widely used for text classification. 

Note that the function above can only resolve binary and supervised prob- 
lems, so different variants are necessary to handle semi-supervised or multiclass 
tasks. 

2.2.1 Semi-supervised Learning for SVM (S 3 VM) 

Semi-supervised learning approaches differ in the learning way of the classi- 
fier. As opposed to supervised approaches, unlabeled data is used during the 
learning phase. Taking into account unlabeled data to learn can help improve 
the performance of supervised classifiers, especially when its predictions provide 
new useful information, as shown in Figure 2.2. However, the noise added by 
incorrect predictions can worsen the learned model and, therefore, the perfor- 
mance of the classifier. This makes interesting the study on whether relying on 
semi-supervised approaches is suitable for a certain kind of task. 

Semi-supervised learning for SVM, also known as S 3 VM, was first introduced 
by Joachims (1999) in a transductive way, by modifying the original SVM function. 
To do that, the author proposed to add an additional term to the optimization 
function (see Equation 2.3). 



where u is the number of unlabeled data, and the parameters with an asterisk 
(*) refer to the unlabeled instances included in the learning phase. 

Nevertheless, the adaptation of SVM to semi-supervised learning significantly 
increases its computational cost, due to the non-convex nature of the resulting 
function, and so obtaining the minimum value is even more complicated. In or- 
der to relax the function, convex optimization techniques such as semi-definite 
programming are commonly used (Xu et al., 2008), where minimizing the func- 
tion gets much easier. 

By means of this approach, Joachims (1999) demonstrated a large perfor- 
mance gap between the original supervised SVM and his proposal for a semi- 
supervised SVM, in favor of the latter one. He showed that for binary classi- 
fication tasks, the smaller is the training set size, the larger gets the difference 
among these two approaches. He used the Reuters-21578, Ohsumed and WebKB 
datasets for that purpose. Although he worked with multiclass datasets, he split 
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Figure 2.2: SVM vs S 3 VM, where black and grey dots are labeled resources, and 
white dots are unlabeled resources. It can be seen that the few labeled resources 
give rise to a certain separation (grey line, SVM), whereas including unlabeled 
ones helps infer a more accurate separation (black line, S 3 VM). 



the problems into smaller binary ones, and so he did not demonstrate whether 
the same performance gap occurs for multiclass classification. More recently, 
Chapelle et al. (2008) presented a comprehensive review of advances in binary 
S 3 VM approaches. 



2.2.2 Multiclass SVM 

Due to the dichotomic nature of SVM, it came up the need to implement new 
methods to solve multiclass problems, where more than two classes must be 
considered. Different approaches have been proposed to achieve this. On the one 
hand, as a native approach, Weston and Watkins (1999) proposed modifying the 
optimization function getting into account all the k classes at once (see Equation 
2.4). 
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The main novelty of this approach, as compared to previous ones, is that apart 
from the choice of a kernel, it is parameter less. Their experiments on benchmark 
datasets from the UCI repository show results similar to SVMs which have been 
tuned to have the best choice of parameter. 

On the other hand, the original binary SVM classifier has usually been com- 
bined to obtain a multiclass solution. As combinations of binary SVM classifiers, 
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two different approaches to k-class classifiers can be emphasized (Hsu and Lin, 
2002): 

• one-against-all constructs k classifiers defining that many hyperplanes, each 
of them separating the class i from the rest k-1. For instance, for a prob- 
lem with 4 classes, the following classifiers would be created: 1 vs 2-3-4, 
2 vs 1-3-4, 3 vs 1-2-4 and 4 vs 1-2-3. Unlabeled resources will be cat- 
egorized in the class of the classifier that maximizes the margin: Q = 
argmax ;= i k{ w i x + As the number of classes increases, the amount of 
classifiers will increase linearly. 

• one-against-one constructs k ^ k 2 ^ classifiers, one for each possible category 
pair. For instance, for a problem with 4 classes, the following classifiers 
would be created: 1 vs 2, 1 vs 3, 1 vs 4,2 vs 3, 2 vs 4 and 3 vs 4. After that, 
it classifies each new document by using all the classifiers, where a vote is 
added for the winning class over each classifier; the method will propose 
the class with more votes as the result. As the number of classes increases, 
the amount of classifiers will increase in an exponential way, and so the 
problem could became very expensive for large taxonomies. 

Both Weston and Watkins (1999) and Hsu and Lin (2002) compare the na- 
tive multiclass approach to the one-against-one and one-against-all binary classifier 
combining approaches. They agree concluding that the native approach does 
not outperform the results by one-against-one or one-against-all, although it con- 
siderably reduces the computational cost because the number of support vector 
machines it defines is smaller. Among the binary combining approaches, they 
show the performance of one-against-one to be superior to one-against-all. 

Although these approaches have been widely used in supervised learning 
environments, they have scarcely been applied to semi-supervised learning. Ac- 
cordingly, we believe that the study on its applicability and performance for this 
type of problems is necessary before proceeding with additional experiments. 

2.2.3 Multiclass S 3 VM 

When the taxonomy is defined by more than two classes and the number of pre- 
viously labeled documents is very small, the combination of both multiclass and 
semi-supervised approaches could be required, i.e., a multiclass S 3 VM approach. 
A common web page classification problem meets these characteristics, with a 
taxonomy of more than two categories, and it could be helpful to increase the 
tiny amount of labeled documents by including predictions on unlabeled data 
for the learning phase. 

However, little work has been done on transforming SVM into both a semi- 
supervised and multiclass approach, and especially on comparing them to other 
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approaches. As a native approach, Yajima and Kuo (2006) modified the original 
SVM function by fitting it to multiclass semi-supervised tasks (see Equation 2.5). 

min \ £ fjT x & + C E E m ^x{0, 1 - - /3).)} 2 (2.5) 
i=l ;=1 

where /3 represents the product of a vector of variables and a kernel matrix 
defined by the author. 

The authors showed that the proposed approach outperformed other non- 
SVM algorithms, but they did not show if it was better than other SVM settings. 
As far as we know, the software was not made publicly available, and no further 
work has been done using this approach. 

Chapelle et al. (2006) present another native multiclass S 3 VM approach by us- 
ing the Continuation Method. This is the only work, to the best of our knowledge, 
where one-against-all and one-against-one approaches had been tested in a semi- 
supervised environment. They apply these methods to news datasets, yielding 
worse performance. Moreover, they show that one-against-one is not sufficient for 
real-world multiclass semi-supervised learning, since the unlabeled data cannot 
be restricted to the two classes under consideration. 

On the other hand, others relied on combining SVM with other algorithms in 
search of a multiclass semi-supervised SVM approach. Qi et al. (2004) use Fuzzy 
C-Means (FCM) to predict labels on unlabeled resources. After that, multiclass 
SVM is used to learn with the augmented training set, classifying the test set. Xu 
and Schuurmans (2005) rely on a clustering-based approach to label the unlabeled 
data. Afterwards, they apply a multiclass SVM classifier to the labeled training 
set. 

It is worthwhile noting that most of the above works introduced their ap- 
proaches and only compared them to other semi-supervised classification meth- 
ods, such as Expectation-Maximization (EM) or Naive Bayes. As an exception, 
Chapelle et al. (2006) compared a semi-supervised and a supervised SVM ap- 
proach, but only over image datasets. In this thesis, we do not aim at proposing 
new SVM approaches. However, we believe that evaluating and comparing mul- 
ticlass SVM and multiclass S 3 VM approaches is necessary to conclude with a 
suitable approach. This would help discover whether learning upon unlabeled 
resources is helpful for multiclass problems when using SVM as a classifier. 

2.3 Benefiting from Social Annotations 

Since it was introduced along with the Web 2.0 phenomenon in the early 2000s, 
social annotations have gained popularity and interest with the creation of well- 
known social tagging sites like Delicious. This section summarizes how social 
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annotations have helped improve information access and management. Among 
the existing works, it is especially focused on their use for classification tasks. 

Social tagging systems arose as an idea of Joshua Schachter, founder of De- 
licious (Smith, 2008). In late 1990s, after the bookmarks in his web browser had 
overflowed, he used to save his favorite URLs in a text file, with an entry per line. 
Each entry was a URL, followed by a set of tags. These tags enabled him to easily 
refind the URL he was looking for. He just had to filter by keyword to search for 
a URL. He also published online such a list at Muxway.org (currently discontin- 
ued and not accessible). Later, in September 2003, he released Delicious, the first 
online tool that enabled saving and tagging URLs as he used to, but enhanced by 
a social environment. This social tool enabled users to search among saved URLs, 
not only by their own tags, but also taking advantage of others'. 

The research on social tagging systems did not arise until 2006. An early 
work by Golder and Huberman (2006) performed a study of the characteristics 
of Delicious, followed by an increasingly interest of researchers that gave rise to 
large number of research works in the field. Next, we focus on some of the most 
relevant advances on the use of social tags for information management, and go 
in more depth for the specific task of resource classification. 

2.3.1 Social Annotations for Information Management 

Social annotations have been widely used for the sake of information manage- 
ment tasks. They have shown to very useful for several tasks in which the avail- 
ability of data is of utmost importance (Gupta et al., 2010): 

• Search: Social tags have been successfully applied to web search. Bao et al. 
(2007) found that social annotations can enhance web search (1) as a good 
summaries of corresponding web pages, and (2) as a way to compute the 
popularity of web pages by considering the number of users who annotate 
them. Heymann et al. (2008) analyzed the usefulness of tags from Delicious 
for web search, and concluded that these metadata can provide additional 
and meaningful data not available in other sources, though it may cur- 
rently lack the size to get a significant impact. Also, Dmitriev et al. (2006) 
showed the usefulness of social annotations for improving the quality of 
intranet search. As a specific approach for searching on social tagging sys- 
tems, Hotho et al. (2006) proposed FolkRank, a search algorithm that fits 
the structure of folksonomies. They found this approach useful for provid- 
ing personalized rankings of the resources in a folksonomy, as well as for 
finding communities of users within these systems. 

• Recommender Systems: Shepitsen et al. (2008) and Li et al. (2008) intro- 
duced recommendation algorithms based on user-generated tags. They 
show that social annotations are effective to discover user interests and, 
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therefore, to recommend them new resources. In Bogers and van den Bosch 
(2008), the authors take advantage of annotations provided by users on a 
social reference manager for recommending research papers to scientists. 
In Cantador et al. (2011), the authors present a mechanism to automatically 
filter and classify social tags in a set of purpose-oriented categories, so that 
they can rely on suitable tags to recommend resources to users. 

• Enhanced Browsing: Social annotations can be helpful to improve the nav- 
igation of resources as well. Smith (2008) describes three new navigation 
ways emerged from folksonomies: pivot browsing (moving through an 
information space by choosing a reference point to browse), popularity- 
driven navigation (retrieving the resources that are popular for a given 
tag), and filtering (social tagging allows to separate the resources you do 
not want from the resources you do want). In a preliminary study, we in- 
tegrated tags from Delicious to Wikipedia (Zubiaga, 2009). Tags provided 
new data that was not available in the content of encyclopedia articles, pro- 
viding a means to enhance the navigation and search on the site. 

• Clustering and Classification: Social tags have also shown to be useful for 
resource organization tasks, including clustering and classification. This 
point will be explained in more depth in the next section. 

2.3.2 Social Annotations for Classification 

Before we began working on this thesis, there was little work dealing with the 
analysis of the applicability and usefulness of social tags for resource classifica- 
tion tasks. Most of them had focused on classifying web pages, and had just 
explored the appropriateness of social tags for this kind of tasks. However, none 
of them performed real classification experiments but just statistical analyses. Ac- 
cordingly, they did not further explore on how to get the most out social tags in 
order to improve the performance. 

Noll and Meinel (2008a) presented a study of the characteristics of social an- 
notations provided by end users, in order to determine their usefulness for web 
page classification. In this work, the authors weight the tags by normalizing the 
number of users annotating them. The least popular tag is given a value of 0, 
whereas the most popular is given a value of 1. This way, they remove those 
least popular tags as they were useless. Moreover, they did not pay attention 
at whether or not this representation approach was appropriate to carry out the 
task. The authors matched user-supplied tags of a page against its categoriza- 
tion by the expert editors of the Open Directory Project (ODP). They analyzed 
at which hierarchy depth matches occurred, concluding that tags may perform 
better for broad categorization of documents rather than for more specific cat- 
egorization. The study also points out that since users tend to bookmark and 
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tag top level web documents, this type of metadata will target classification of 
the entry pages of websites, whereas classification of deeper pages might require 
more direct content analysis. They observed that in the power law curve formed 
by the popularity of social tags, not only popular tags, but also the tags in the tail 
provide helpful data for information retrieval and classification tasks in general. 
In a previous work, the same authors (Noll and Meinel, 2007) suggested that tags 
provide additional information about a web page, which is not directly contained 
within its content. 

Also, Noll and Meinel (2008b) studied three types of metadata about web 
documents: social annotations (tags), anchor texts of incoming hyperlinks, and 
search queries to access them. They concluded that tags are better suited for 
classification purposes than anchor texts or search keywords. 

As regards to clustering tasks, Ramage et al. (2009) included tagging data 
for improving the performance of two clustering algorithms when compared to 
content-based clustering. They found that tagging data was more effective for 
specific collections than for a collection of general documents. They weighted the 
tags by both using the number of users annotating them, and reweighting this 
value considering the Inverse Document Frequency (IDF) value of the tag across 
the resources in the collection. They showed a superiority for the use of the IDF 
weighting scheme. 

Even though those were the only works published by then, the interest on 
this research area has increased lately. After the presentation of our earliest work 
in the field (Zubiaga et al., 2009d), more scientists have shown their interest on it, 
and have presented new works. 

In Aliakbary et al. (2009), the authors integrated social annotations as an 
approach to extending web directories. They relied on the number of users an- 
notating each tag as a weight. Upon that, they created a model vector for each 
category, and computed the cosine similarity to new web pages to generate pre- 
dictions. They observed that the annotations provided a multi-faceted summary 
of the web pages, and that they better represent the aboutness of web pages than 
the content itself. Also, they conclude that the more users annotate a URL, the 
better it is classified. 

In another work where social tags were exploited for the benefit of web page 
classification, Godoy and Amandi (2010) also showed the usefulness of social tags 
for web page classification, which outperformed classifiers based on full-text of 
documents. Similar to our previous work (Zubiaga et al., 2009d), they compare 
tag-based resource representations relying on all the tags and the top 10 of tags 
for each resource, corroborating our findings that the former performs better. 
Going further, they concluded that stemming the tags reduces the performance 
of such classification, even though some operations such as removal of symbols, 
compound words and reduction of morphological variants have a discrete posi- 
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tive impact on the task. 

Xia et al. (2010) studied the usefulness of social tags as a complementary 
source for improving the classification of academic conferences into correspond- 
ing topics. Using tagging data gathered from WikiCFP 1 , and weighing the tags 
according to the number of users annotating them, they compare the classification 
of conferences by using only the content of the call for papers, and by integrating 
tagging data along with it. Their experiments yielded slightly better performance 
for the integration of social tags, with roughly 1% improvement. 

With regard to the classification of resources other than web pages, Lu et al. 
(2010) present a comparison of tags annotated on books and their Library of 
Congress subject headings. Actually, no classification experiments are performed, 
but a statistical analysis of the tagging data shows encouraging results. By means 
of a shallow analysis of the distribution of tags across the subject headings, they 
conclude that user-generated tags seem to provide an opportunity for libraries to 
enhance the access to their resources. 

Using a graph-based approach, in Yin et al. (2009) the authors present a 
method to classify products from Amazon into their corresponding categories 
using social tags. They conclude that social tags can enhance web products clas- 
sification by representing them in a meaningful feature space, interconnecting 
them to indicate relationship, and bridging heterogeneous products so that cate- 
gory information can be propagated from one domain to another. 

There is also a set of works dealing with user behavior in social tagging sys- 
tems. Even though they do not perform classification experiments, they suggest 
the existence of a subset of users in these systems who are rather categorizing 
the resources when annotating. Specifically, early works such as Hammond et al. 
(2005b) and Marlow et al. (2006a) suggest the existence of two types of users: on 
one hand, users can be motivated by categorization (so-called Categorizers). These 
users view tagging as a means to categorize resources according to some (shared 
or personal) high-level conceptualizations. On the other hand, users who are mo- 
tivated by description (so-called Describers) view tagging as a means to accurately 
and precisely define the content of resources. These proposals have been further 
studied in Korner et al. (2010a) and Korner et al. (2010b). However, they focused 
on showing that the difference of motivation among those two kinds of users ac- 
tually exists, and they did not pay attention at whether Categorizers are better 
suited to the classification task. 

Also, there has recently been an increasingly interest in using social tags for 
the benefit of clustering tasks. In Lu et al. (2009) the authors not only cluster the 
annotated resources, but also users and tags. Zhang et al. (2009) found that the 
effectiveness of clustering blog posts using tags from a simple tagging system was 
quite limited, and they combined these data with relations in the blogosphere to 

1 http:/ /www. wikicfp.com 
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get better results. 

So far, there is little work on the analysis of folksonomies for the classification 
task. A few works have shown their suitability for this purpose, but no special 
attention has been paid into further studying these metadata structures. In this 
thesis, we aim at analyzing these structures to find an approach to amalgamate 
and represent the tags in order to perform a resource classification task with a 
high accuracy. 
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3 

Support Vector Machines for Large-Scale 

Classification 



Science is the systematic classification of experience. 
— George Henry Lewes 



We study the appropriateness of several SVM approaches for 
large-scale classification in this chapter. We are not going to in- 
troduce any new approaches in it, but to perform a preliminary 
comparison study among different approaches, in search of a 
suitable approach for large-scale classification tasks. We also 
evaluate the real contribution of unlabeled data for multiclass 
SVM-based classification tasks. Specifically, we compare a na- 
tive multiclass approach to the combination of binary classifiers, 
as well as a supervised to a semi-supervised approach. We carry 
out such an experimentation by using three web page datasets. 
The Web is a good example of the problem we are dealing with, 
where the number of resources is very large, and the number of 
labeled ones tends to be tiny as compared to the whole collec- 
tion. The chapter is organized as follows. Next, in Section 3.1 
on the next page we define and present the features of a large- 
scale classification task. We enumerate and describe the SVM 
approaches compared in our study in Section 3.2 on the follow- 
ing page. Then, in Section 3.3 on page 51 we detail the settings of 
the experiments, showing their results in Section 3.4 on page 53. 
Finally, we discuss the results in Section 3.5 on page 55, and con- 
clude and answer the following research questions in Section 3.6 
on page 56: 
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Research Question 1 

What kind of SVM classifiers should be used to perform this 
kind of classification tasks: a native multiclass classifier, or a 
combination of binary classifiers? 

Research Question 2 

What kind of learning method performs better for this kind of 
classification tasks: a supervised one, or a semi-supervised one? 

3.1 Definition of Large-Scale Classification 

The Web comprises lots of collections of web documents and other resources that 
scale up constantly. The increasingly amount of resources on the Web has in 
part influenced the recent upsize of research on large-scale datasets. Accordingly 
extending earlier studies on SVM classification to large-scale collections rises in 
importance. 

In this regard, we believe that a thematic classification task commonly meets 
the following conditions when it comes to large-scale collections of resources: 

• Tininess of the training set: getting a manual classification of a subset as 
a training set is very expensive and entails a lot of time and effort. Thus, 
the previously categorized subset will be tiny as compared to the uncat- 
egorized subset. This suggests considering semi-supervised approaches 
besides supervised ones, as a way of extending the training set. 

• Multiclass taxonomy: a taxonomy is usually composed of more than two 
categories, and thereby it is considered as a multiclass task instead of a 
binary one. 

We assume that the large-scale thematic classification tasks we are undertak- 
ing in this thesis will fulfill those two features. 

3.2 Compared SVM Approaches 

The two features showed above encourage the study of different SVM settings to 
conclude with a suitable one. On the one hand, the tininess of the training set 
requires analyzing whether or not it is worthwhile relying on a semi-supervised 
approach extending it instead of a supervised one. An early work by Joachims 
(1999) showed the outperformance of the former for binary classification, but it 
is not clear whether or not the same happens for multiclass scenarios. Especially 
because labeling instances on a larger taxonomy is harder, and it seems much 
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likelier to introduce noise in the extension of the training set. On the other hand, 
the classification on a multiclass taxonomy can be faced up in two different ways: 
as a single multiclass task, or as several smaller binary tasks. Little work has been 
done comparing these two settings, though. The lack of analyses on the appro- 
priateness of the aforementioned methods for the task conveyed us to perform 
such a study prior to getting to work on large-scale classification tasks. 

Our study involves both supervised SVM and semi- supervised SVM (S 3 VM) 
approaches on different multiclass settings. Specifically, we rely on three multi- 
class settings which were introduced in Section 2.2.2 on page 38: a native multi- 
class algorithm, and one-against-one and one-against-all based on binary classifiers. 
When naming the approaches, we add a suffix -mSVM, -SVM or -S 3 VM for clarity, 
depending if they are multiclass, supervised or semi-supervised, respectively. 

In order to compare a supervised approach and different levels of semi- 
supervision, we created several subsets of labeled and unlabeled instances within 
the training sets. This enables to analyze different levels of semi-supervision. 
While the size of the training set remains fixed, smaller subsets of labeled in- 
stances in it yield a rather semi-supervised approach. The size of the labeled 
subset ranges from 50 instances to the whole training set. Figure 3.1 shows 
how we split training sets into labeled and unlabeled subsets. Supervised ap- 
proaches learn from the labeled subset, and ignore the unlabeled one, whereas 
semi-supervised approaches make predictions on the latter to increase the learn- 
ing base. For each training set size, we perform 6 different selections of labeled 
subsets. We show the average accuracy of all 6 runs on the results. 



Training set 



-+- 



Test set 



Labeled «- 



Unlabeled 



Figure 3.1: Example of splitting a training set into labeled and unlabeled subsets. 
The former remains fixed, whereas the size of the latter two changes. 



3.2.1 Native Multiclass Approaches 

Native multiclass approaches consider the classification as a single task per- 
formed by only one classifier. They can be implemented either in a supervised or 
a semi-supervised basis. However, little work has been done on developing native 
semi-supervised approaches. The only algorithm was presented by Yajima and 
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Kuo (2006), which as far as we know has not been used in later works. The algo- 
rithm is not available for the community, and its implementation does not seem 
feasible because its description does not provide enough details to reproduce it. 
Thus, we propose the implementation of a semi-supervised method by following 
an approach similar to those by Qi et al. (2004) and Xu and Schuurmans (2005). 
They perform a two-step classification task, by extending the training set using a 
clustering algorithm in the first step. Afterward, they run a supervised SVM on 
the extended training set. Our approach differs from those two in that we use the 
same algorithm in both steps. With an approach we call 2-steps-mSVM, we extend 
the training set using a supervised SVM, i.e., learning from the labeled subset, 
and labeling the unlabeled subset relying on classifier's decisions. We run the 
same algorithm on the extended set after that. As a supervised method, we use a 
native multiclass SVM, which we call 1-step-mSVM. 

3.2.2 One- Against- All Approaches 

One-against-all is a method to split the multiclass task into smaller binary prob- 
lems. Specifically, it creates k classifiers defining that many hyperplanes; each of 
them separates the class z from the remainder k-1 . Thus, the number of classifiers 
is the same as the number of classes. In the test phase, all the classifiers will 
provide a margin for each instance, defining whether it belongs to the positive 
class (class z) or the negative class (the remainder k-1). Putting together the out- 
puts of all classifiers for an instance (i.e., margins provided by classifiers), the 
one with the largest positive value will be selected as the system's decision (see 
Equation 3.1). 

Q = arg max (wjX + b{) (3.1) 
i—l,...,k 

We implemented this approach with a supervised binary SVM (one-against- 
all-SVM) and a semi-supervised binary SVM (one-against-all-S 5 VM). 

3.2.3 One-Against-One Approaches 

One-against-one is another method that divides a multiclass problem into smaller 
binary ones. Different from the above method, it creates a binary classifier for 
each possible pair among the k categories, what produces ^ 1-vs-l classifiers. 
Again, a margin for all the instances is given by all the classifiers in the test phase, 
but the way of amalgamating the outputs changes in this case. Considering as 
a positive vote each time that a class beats the other in the binary classifiers, the 
class with most positive votes will be predicted by the system. 
This method has two major problems, though: 
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1. As the number of classes increases, the amount of classifiers will increase 
in an exponential way, and so the problem could become very expensive 
for large taxonomies. 

2. During the test phase, a 1-vs-l classifier is unable to ignore those instances 
that actually belong to none of the considered pair of classes. Thus, includ- 
ing all the instances in the test phase is the only solution, and given that 
binary classifiers will provide a margin for every instance, it seems that the 
test phase can become noisy. This issue was also pointed out by Chapelle 
et al. (2006). 

As for the one-against-all approaches, both a supervised binary SVM and a 
semi-supervised binary SVM were used to implemented two different settings of 
this approach: one-against-one-SVM and one-against-one-S 3 VM. 

3.3 Experiment Settings 

This section introduces the datasets we have used to compare the different SVM 
approaches, as well as other settings. 

3.3.1 Datasets 

In order to perform the experimentation on a multiclass scenario, we looked for 
suitable datasets. As benchmark datasets that have been used several times for 
research on classification, we chose the following: 

• BankSearch (Sinka and Corne, 2002), a collection of 11,000 web pages over 
11 classes, with very different topics: commercial banks, building societies, 
insurance agencies, java, c, visual basic, astronomy, biology, soccer, motor- 
sports and sports. We removed the category sports, since it includes both 
soccer and motorsports in it, and it is not at the same level as the rest of 
categories. This results in 10,000 web pages over 10 categories. 4,000 in- 
stances were assigned to the training set, while the other 6,000 were left on 
the test set. 

• WebKB 1 , with a total of 4,518 documents from 4 universities, and classified 
into 7 classes (student, faculty, personal, department, course, project and 
other). The class named other was removed due to its ambiguity, and so we 
finally got 6 classes. 2,000 instances fell into the training set, and 2,518 into 
the test set. 



http:/ / www.cs.cmu.edu/afs/ cs.cmu.edu/ project/theo-20/www/data/ 
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• Yahoo! Science (Tan et al., 2002), with 788 scientific documents classified into 
6 classes (agriculture, biology, earth science, math, chemistry and others). 
We selected 200 documents for the training set, and 588 for the test set. 

Even though these cannot quite be considered as large-scale datasets, the fact 
that the selected training sets are small as compared to whole collections makes 
the problem more similar. The selection of number of instances on the training 
sets above depend on the number of classes and the size of each dataset. 

3.3.2 Document Representation 

SVM relies on a Vector Space Model (VSM) and thereby it requires a vectorial 
representation of the documents as an input for the classifier, for both train and 
test phases. To obtain this vectorial representation, we use the textual content of 
the web pages. To this end, we first converted the original HTML codes into plain 
text strings, removing all the HTML tags. After that, we removed a set of useless 
tokens, such as URLs, email addresses and stopwords from a public list 2 . The 
vectors representing the documents are composed of the remaining terms, where 
each dimension corresponds to a term. The weights of these terms in the vectors 
are defined by the TF-IDF (Term Frequency - Inverse Document Frequency) term 
weighting function (Saltan and Buckley, 1988). In order to relax the computa- 
tional cost of the task, we then removed the least-frequent terms by its document 
frequency; terms appearing in fewer than 0.5% of the documents were removed 
for the representation 3 . This process yielded term vectors with 8285 dimensions 
for BankSearch dataset, 3115 for WebKB and 8437 for Yahoo! Science. 

3.3.3 Algorithmic Implementation 

The 6 SVM approaches presented above require 3 different classifiers to con- 
struct them: a supervised multiclass one, and two binaries, one supervised and 
one semi-supervised. Taking into account that some SVM implementations are 
freely available for research, we looked for experimented and tested software. 
Among the studied alternatives, we opted to use svm-light 4 and its variants, by 
Thorsten Joachims (Joachims, 1998). We used supervised svm-light for one-against- 
one-SVM and one-against-all-SVM approaches, whereas one-against-one-S 3 VM and 

2 http: / / www.textfixer.com/ resources/ common-english-words.txt 
3 0.5% was a reasonable value for the number of resources we were dealing with. How- 
ever, this reduction applies to all the algorithms we compare in this work, and keeping the 
same reduction for all of them makes their results equally comparable while reducing the 
computational cost. 

4 http: / / svmlight.joachims.org 
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one-against-all-S 5 VM were implemented by using semi-supervised svm-light. Fi- 
nally, we used svm-multiclass 5 to implement 1-step-mSVM and 2-steps-mSVM ap- 
proaches. 

3.3.4 Evaluation Measures 

Most of the results, not only in this chapter but also in the following chapters of 
this thesis, have to do with classification experiments. In order to evaluate their 
performance along this work, we use the accuracy as an evaluation measure. It 
has been widely used for text classification tasks, especially when it comes to 
multiclass problems. The value computed as the accuracy gives the percentage 
of correct predictions within the whole test set. We consider all the classes in 
the taxonomies to be equally relevant to the final performance, so that we do not 
consider any weightings in the evaluation process. Accordingly, a correct guess 
adds the same positive value on the accuracy, regardless of the class it belongs to. 

Tables presenting accuracy values in this thesis show different training set 
sizes on each column, and different approaches or representation methods on 
each row. These accuracy values are emphasized in bold for outscoring perfor- 
mances within each table. 

3.4 Results 

This section presents the results of the experiments comparing the SVM ap- 
proaches. We show the results organized by dataset, and analyze the different 
approaches, studying the appropriateness of multiclass or binary classifiers, as 
well as a supervised or a semi-supervised learning. Table 3.1 on the following 
page, Table 3.2 on the next page and Table 3.3 on the following page show those 
results for BankSearch, WebKB and Yahoo! Science datasets respectively. 

3.4.1 Native Multiclass vs Combining Binary Classifiers 

Our experiments compare two native multiclass approaches to four methods 
combining binary classifiers. First of all, the results clearly show that those re- 
lying on the one-against-one setting (i.e., one-against-one-SVM and one-against-one- 
S 3 VM) perform much worse than the rest for all the datasets. This outperfor- 
mance confirms the issue we pointed out above in Section 3.2.3 on page 50, i.e., 
the inability of discriminating the instances that do not belong to the considered 
pair of classes adds noise into the decisions. 

Among the other two settings, the native multiclass SVMs and the one-against- 
all, there is a clear outperformance for the former. The performance gap between 

5 http:/ / svmlight.joachims.org/ svm_multiclass.html 
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BankSearch 


# of labeled instances 


50 


100 


200 


500 


1000 


2000 


3000 


4000 


1-step-mSVM 


.579 


.706 


.792 


.869 


.897 


.919 


.925 


.930 


2-steps-mSVM 


.628 


.753 


.826 


.879 


.898 


.916 


.923 


.930 


one-against-all-SVM 


.372 


.485 


.575 


.697 


.759 


.816 


.843 


.855 


one-against-all-S 3 VM 


.506 


.566 


.621 


.709 


.763 


.814 


.842 


.855 


one-against-one-SVM 


.311 


.443 


.549 


.679 


.744 


.803 


.826 


.840 


one-against-one-S 3 VM 


.443 


.513 


.567 


.668 


.724 


.782 


.811 


.840 



Table 3.1: Accuracy results for the BankSearch dataset. 



WebKB 


# of labeled instances 


50 


100 


200 


500 


1000 


2000 


1-step-mSVM 


.600 


.677 


.739 


.787 


.810 


.822 


2-steps-mSVM 


.582 


.667 


.715 


.750 


.778 


.822 


one-against-all-SVM 


.513 


.587 


.673 


.744 


.776 


.783 


one-against-all-S 3 VM 


.592 


.642 


.691 


.740 


.773 


.783 


one-against-one-SVM 


.488 


.554 


.648 


.736 


.775 


.791 


one-against-one-S 3 VM 


.494 


.579 


.651 


.718 


.754 


.791 



Table 3.2: Accuracy results for the WebKB dataset. 



Yahoo! Science 


# of labeled instances 


50 


100 


200 


1-step-mSVM 


.682 


.825 


.908 


2-steps-mSVM 


.687 


.836 


.908 


one-against-all-SVM 


.506 


.536 


.630 


one-against-all-S 3 VM 


.570 


.565 


.630 


one-against-one-SVM 


.436 


.483 


.586 


one-against-one-S 3 VM 


.467 


.514 


.586 



Table 3.3: Accuracy results for the Yahoo! Science dataset. 
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those two approaches differs depending on the dataset. Even though it is much 
smaller for the WebKB dataset than for the other two, it is undoubtedly clear that 
the native multiclass approach seems a better setting to face this kind of tasks. 
Moreover, regardless of the size of the labeled subset, the multiclass settings al- 
ways outperform the others. 

3.4.2 Supervised vs Semi-Supervised Learning 

Besides these three settings, we have also compared a supervised and a semi- 
supervised learning for each of them in our experiments. When comparing 
the two analogous approaches for each setting, it can be seen that the semi- 
supervised ones (2-steps-mSVM, one-against-all-S 5 VM and one-against-one-S 3 VM) 
perform better than the supervised ones (1-step-mSVM, one-against-all-SVM and 
one-against-one-SVM) in most cases when it comes to the smallest labeled subsets. 
However, the contrary happens for larger labeled subsets, where the supervised 
approaches perform better. Looking at these results, it seems that the success 
of semi-supervised learning for multiclass classification is limited to very small 
labeled sets, where more instances are required in order to get a sufficient base 
to learn from. 

Going in more depth in the native multiclass approaches, which perform 
the best, a similar conclusion can be drawn, especially for the largest dataset, 
BankSearch. Even though the semi-supervised 2-steps-mSVM performs better than 
the supervised 1-step-mSVM for the smallest labeled subsets, there is a slight 
outperformance for the latter when the labeled subset increases. In the case of 
WebKB, 1-step-mSVM is always the best, probably because it is harder to predict 
correctly the unlabeled instance in the semi-supervised scenario when the taxon- 
omy is made by closely related categories, and it adds noise in the learning phase. 
Finally, for Yahoo! Science, 2-steps-mSVM performs slightly better, but since this 
dataset is quite small, it does not let us see whether 1-step-mSVM would outper- 
form for larger labeled subsets. 

3.5 Discussion 

In this study, we have compared the required approaches to help us determine (a) 
if we should use a native multiclass classifier or combine binary classifiers, and 
(b) whether or not including the predictions on unlabeled instances improves 
the performance of the classifier. This is not an exhaustive comparison study 
between SVM approaches for large-scale classification on multiclass taxonomies. 
An example of this is that we did not consider any native multiclass and semi- 
supervised approaches like that by Yajima and Kuo (2006), which we did not have 
access to -reasonwhy it has not been used subsequently. We have compared a set 
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of approaches available for research purposes instead. 

3.6 Conclusion 

In this chapter, we have analyzed a set of approaches to face a large-scale topical 
classification task, considering that it fulfills the conditions that (a) it is multiclass 
with more than two classes in the taxonomy, and (b) the labeled subset tends to 
be tiny as compared to the whole set to classify. Looking at these two aspects, 
we have compared 6 different SVM approaches, including (a) semi-supervised 
and supervised learning, and (b) 3 different settings, a native multiclass and 2 
binary settings, one-against-one and one-against-all. With experiments over 3 dif- 
ferent datasets, we have performed a comparison study between the different 
SVM approaches. 

Parts of the research in this chapter have been published in Zubiaga et al. 
(2009b) and Zubiaga et al. (2009a). 

We have also answered the following research questions in this chapter: 

Research Question 1 

What kind of SVM classifiers should be used to perform this kind of classification 
tasks: a native multiclass classifier, or a combination of binary classifiers? 

We have shown the clear superiority of the native multiclass SVM classifiers 
over the other approaches combining binary classifiers. Our results show that 
relying on a set of binary classifiers is not a good option when it comes to mul- 
ticlass taxonomies. Accordingly, native multiclass classifiers, which consider all 
the classes at the same time and have more knowledge of the whole task, perform 
much better. 

Research Question 2 

What kind of learning method performs better for this kind of classification tasks: 
a supervised one, or a semi-supervised one? 

Semi-supervised approaches may perform better when the labeled subset is 
really small, but supervised approaches, which are computationally less expen- 
sive, perform similarly with more labeled documents. Therefore, we have also 
shown that, unlike binary tasks as shown by Joachims (1999), a supervised ap- 
proach performs very similar to a semi-supervised approach on these environ- 
ments. It seems reasonable that predicting the class of uncategorized documents 
is much more difficult when the number of classes increases, and so the miscate- 
gorized documents are harmful for classifier's learning. 

Thereby, according to these conclusions, we decided to use a supervised mul- 
ticlass SVM in this thesis, i.e., svm-multiclass by Joachims (1998). We use the 
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1-step-mSVM approach in Chapter 5 on page 75, Chapter 6 on page 95 and Chap- 
ter 7 on page 111. 
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4 

Generation of Social Tagging Datasets 



"As a general principle, the more users share about themselves, the more others in the 
community will learn about them and identify with them. " 
— Matt Rhodes 

This chapter describes and analyzes in detail the social tagging 
datasets we have created to use throughout this work. After 
looking for existing datasets, we found no one that fulfilled our 
requirements. Hence, we introduce the process we followed for 
generating suitable datasets, and we analyze their main charac- 
teristics. 

The chapter is organized as follows. First, in Section 4.1 on the 
next page we describe the requirements and criteria that led us 
to the selection of the appropriate social tagging systems. In Sec- 
tion 4.2 on page 61 we comprehensively analyze the features of 
the selected social tagging systems. Next, we present the process 
we carried out for gathering the datasets from the Web in Sec- 
tion 4.3 on page 63. Then, we analyze the folksonomies of such 
datasets and present a set of statistics in Section 4.4 on page 65. 
In Section 4.5 on page 70 we introduce the additional data, be- 
sides tagging data, we retrieved and included in the datasets. 
Finally, we conclude and answer the following research ques- 
tion in Section 4.6 on page 72: 

Research Question 3 

How do the settings of social tagging systems affect users' an- 
notations and the resulting folksonomies? 
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4.1 Selection of Social Tagging Systems 

First of all, we defined a set of conditions that the selected social tagging systems 
should fulfill according to our requirements: 

1. They must have a large community of users involved. This enables to 
further analyze the aggregation of annotations. The fact of considering 
whether or not a community is large can obviously be subjective, though. 
We consider it large enough when there is an active community and re- 
sources tend to be annotated by many users. 

2. In order to gather the required data, they must provide an accessible API, 
or an alternative way to access the data by HTML scraping instead. The 
required data include full access to the triple involved in each bookmark, 
i.e., the user annotating it, the resource being annotated, and the tags. This 
is extremely relevant to analyze the nature and structure of folksonomies, 
and how they are created. 

3. Regarding the ground truth we will assume for the classification tasks, 
the considered resources must somewhere be classified on consolidated 
taxonomies by experts. These categorization data will provide a way to 
quantitatively evaluate the classification tasks. 

We thought it would be wise to analyze the existence of social tagging datasets 
that fulfilled our requirements. Even though we looked for social tagging datasets 
created and made publicly available by others, we just found a few of them by 
then, and none of them matched our needs 1 . Therefore, we decided to create 
new datasets. Before creating the datasets, though, it is of utmost importance to 
select the appropriate social tagging sites to collect them from. Since we wanted 
to analyze in depth the tagging structure of folksonomies, we were required to 
get data as detailed as possible. However, not all the social tagging sites provide 
all these data. 

We analyzed a large set of social tagging sites, and studied whether or not 
they matched the above requirements. We found that most of them were in the 
long tail according to the size of the community, with small and almost inactive 
groups of contributors 2 . We ruled them out, and considered those in the head 
with large and active communities. Not all of them provide all the required 
data, though. Some social tagging sites show the aggregated list of tags for each 
resource, but there is no way to extract bookmark data, and thus the exhaustive 

l By then, the only dataset with categorization data for tagged resources was 
CABS120k08 by Noll and Meinel (2008b), but it did not fulfill our requirements: 
http:/ / www.michael-noll.com/ cabsl20k08/ 

2 e.g., CiteULike (http://www.citeulike.org) is a bookmarking site for publications 
where usually there is no enough aggregation of annotations on a resource. 
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list of users who contributed and tags assigned by them when saving the resource. 
Moreover, some social tagging sites only show a list of tags for each resource, 
without the number of users annotating them 3 . Also, there are sites where the 
annotated resources have no consolidated category data 4 . Hence, even though 
there are lots of social tagging sites available online, most of them restrict the 
access to data, or do not fulfill all the requirements. Thereby we finally got a 
smaller list of social tagging sites, since lots of them had to be discarded: (i) 
Delicious 5 , where users save and annotate web pages, (ii) LibraryThing 6 , a social 
tagging site for books, and (iii) GoodReads 7 , also for books. In fact, all of them 
consist of bookmarks of resources which are regularly classified by experts. Web 
pages have been organized into web directories since 1990s, and librarians have 
been cataloging books into categories for centuries. 



4.2 Characteristics of the Selected Social Tagging Sys- 
tems 

Even though all the tagging systems have the same end of enabling users to book- 
mark and annotate the resources of their interest, there are several features that 
make each of them different from the rest. The design of the interface, constraints 
on the inputted tags, and other features could influence users' annotations. Thus, 
it is worthwhile studying the nature of each of the social tagging sites we rely on, 
in order to understand their underlying folksonomies. 

Delicious is a social bookmarking site that allows users to save and tag their 
favorite web pages, in order to ease the subsequent navigation and retrieval on 
large collections of annotated bookmarks. Being a social bookmarking site, every 
web page can be saved, so that the range of covered topics can become as wide 
as the Web is. It is known that the site is biased to some computer and design 
related topics though. Tagging web pages is one of the main features of the site, 
and that is the first thing the system asks for when a user saves a URL as a 
bookmark. The system suggests tags used earlier for that URL if some users had 
annotated it before. Thus, new annotators can easily select tags used by earlier 
users without typing them. This could encourage users to reuse others' tags, 
reducing the number of new tags assigned to a resource. 

3 e.g., GiveALink (http:/ / www.givealink.org) only shows an unweighted list of popular 
tags for each bookmarked web page. 

4 e.g., Last.fm (http:/ /www.last.fm) provides large amounts of annotations for musical 
groups, but there is no standard taxonomy organizing them by musical genres. 

5 http://del icious.com 

6 http:/ /www.librarything.com 

7 http:/ /www.goodreads.com 
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LibraryThing and GoodReads are social cataloging 8 sites where users save 
and annotate books. Commonly, users annotate the books they own, they have 
read, or they are planning to read. We believe that users contributing to this 
kind of sites are more knowledgeable of the resources than those contributing 
to social bookmarking systems. Moreover, there are also writers and libraries 
contributing as users, who have a deep background on the field. This could yield 
annotations providing further and more detailed knowledge. The main difference 
among these two systems is that LibraryThing does not suggest tags when saving 
a book, whereas GoodReads lets the user select from tags within their personomy 
that is, tags they previously assigned to other books. The latter makes it easier 
to reuse users' favorite tags, without re-typing them. This could encourage users 
to keep a smaller tag vocabulary, where they barely use new tags they did not 
used previously. Moreover, LibraryThing brings the user to a new page when 
saving a book, where they can attach tags to it; GoodReads, though, requires the 
user to click again on the saved book to open the form to add tags. Another 
remarkable difference is that LibraryThing allows some users to group tags with 
the same meaning, linking thus typos, misspellings, synonyms and translations to 
a single tag, e.g., science-fiction, sf and ciencia f iccion are grouped 
into science fiction. 

Despite of the aforementioned differences, all of them have some characteris- 
tics in common: users save resources as bookmarks, a bookmark can be annotated 
by a variable number of tags ranging from zero to unlimited, and the vocabulary 
of the tags is open and unrestricted. Table 4.1 summarizes the main features of 
the three social tagging sites we study in this thesis. 





Delicious 


LibraryThing 


GoodReads 


Resources 


web documents 


books 


books 


Tag suggestions 


based on earlier 
bookmarks on the 
resource 


no 


based on user's per- 
sonomy 


Users 


general 


readers, writers & 
libraries 


readers, writers & li- 
braries 


Tag grouping 


no 


selected users sug- 
gest merging tags 


no 


Vocabulary 


open 


open 


open 


Tag insertion 


space-separated 


comma-separated 


one by one text-box 


When saving a re- 
source 


prompts user to add 
tags 


prompts user to add 
tags at second step 


user needs to click 
again to add tags 



Table 4.1: Characteristics of the studied social tagging systems. 



8 Both social bookmarking and social cataloging refer to social tagging systems. The sole 
difference is on the resources, i.e., URLs are bookmarked, whereas books are cataloged. 
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4.3 Generation Process of Datasets 

Even though the three chosen social tagging sites provide public access to the 
full bookmarking activity, getting large collections of data from them turns into a 
complicated task. All of them have an API for accessing the data, but none of the 
APIs provides the required data, so crawling the sites and scraping the HTML 
code instead seems to be the only way to achieve the goal. Moreover, each site 
sets its own limit on the number of requests, and lots of them must be done in 
order to obtain large-scale datasets. Hence, we set a crawling policy for each site, 
and applied it with extra care in order to not get banned while getting as much 
data as possible. 

4.3.1 Getting Popular Resources 

As a starting point, we focused on getting a set of popular resources from each 
site. This provided an initial list of popular resources which represented a good 
seed to start the gathering process from. Those resources were also more likely 
to have been categorized by experts rather than resources in the tail with fewer 
annotations. We could also start the process by looking for popular tags or active 
users, but starting from resources sounds reasonable when those are what we 
aim to classify. Next, we will focus on the process of gathering the data in such a 
way that those resources are well represented insofar as involved users and their 
annotations are taken into account. Apart from representing those resources, we 
were also interested in gathering additional data, in order to represent involved 
users and tags to a great extent. 

First of all thus we queried the three sites for popular resources. We consider 
a resource to be popular if at least 100 users have bookmarked it 9 . In the case 
of Delicious, we found a set of 87,096 unique URLs fulfilling this requirement. 
As regards to LibraryThing and GoodReads, we found an intersection of 65,929 
popular books. Since the latter two rely on the same resources, we created parallel 
datasets for them, where the same books have categorization data attached. 

4.3.2 Looking for Classification Data 

In the next step, we looked for classification labels assigned by experts for both 
kinds of resources. For the URLs gathered from Delicious, we used the Open Di- 
rectory Project 10 (ODP) as a classification scheme. ODP is an open web directory, 
constructed and maintained by a community of volunteer editors, and it includes 

9 It was shown that the tag set of a resource tends to converge when 100 users contribute 
to it (Golder and Huberman, 2006). Thereby we consider it as a threshold for a resource 
to be popular. 

10 http:/ /www.dmoz.org 
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categorization data on a hierarchical structure for more than 4 million URLs. A 
matching between popular URLs on Delicious and those in the ODP returned a 
set of 12,616 URLs with a category assigned. For the set of books, we fetched their 
classification for both the Dewey Decimal Classification (DDC) and the Library of 
Congress Classification (LCC) systems. The former is a classical taxonomy that 
is still widely used in libraries, whereas the latter is used by most research and 
academic libraries. We found that 27,299 books were categorized on DDC, and 
24,861 books had an LCC category assigned. In total, there are 38,148 books with 
category data from either one or both category schemes. 

In this thesis, we will focus on both the top level and the second level of the 
taxonomies. This enables to evaluate the usefulness of social tags for classifica- 
tion on both broader and narrower categories. Even though taxonomies are made 
up by more than 2 levels of categorization, going into deeper levels would lack 
of enough number of resources for each category, and would not enable an ap- 
propriate experimentation. Table 4.2 summarizes the number of classes in each 
taxonomy and level, as well as the number of resources with categorization data 
for each of them. We kept the structure of all the taxonomies as they were, but 
made a little change for LCC: we merged E (History of America) and F (History 
of the United States and British, Dutch, French, and Latin America) categories into 
a single one, as it is not clear that they are disjoint categories. Also, note that 
the number of resources is slightly smaller for second levels. This is because we 
removed second-level categories and their underlying resources when there were 
fewer than 5 resources in them, due to the low representativity 11 . 





Top level 


Second level 




Resources 


Classes 


Resources 


Classes 


ODP 


12,616 


17 


12,286 


243 


DDC 


27,299 


10 


27,040 


99 


LCC 


24,861 


20 


23,565 


204 



Table 4.2: Number of resources and classes for the classification experiments. 
4.3.3 Gathering Tagging Data 

Finally, we queried (a) Delicious for gathering all the personomies involved in 
the set of categorized URLs, and (b) LibraryThing and GoodReads for gathering 
all the personomies involved in the set of categorized books. By personomy we 
consider the whole list of bookmarks posted by a user, including an identifier of 

11 The threshold of 5 resources is arbitrary. It is reasonable from our point of view, 
because it increases the likelihood of having more than one learning instance for each 
category, and the reduction of the dataset is minimal. 
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the resources and the tags attached by them. All three sites present no restrictions 
on the bookmarks shown in personomies, so that they return all available public 
bookmarks for the queried users. 

The process above results in a large collection of bookmarks for each dataset. 
Within the gathered data, we focus on the following information for each book- 
mark: 

• User (U): an identifier of the user who annotated the resource. 

• Resource (R): the resource annotated by the bookmark. It is a URL in the 
case of Delicious, and the ISBN identifier of a book in the case of Library- 
Thing and GoodReads. 

• Tags (T): the set of tags, in case it is available, annotated by the user to the 
resource. 

That is, the triple of U x R x T involved in a bookmark. In this process, 
we consider all the tags attached to each bookmark, except for GoodReads. In 
this case, a tag is automatically attached to each bookmark depending on the 
reading state of the book: read, currently-reading or to-read. We do not 
consider this to be part of the tagging process, but just an automated step that 
does not provide useful information for classification, and we removed all their 
appearances in our dataset. 

4.4 Statistics and Analysis of the Datasets 

In order to understand the nature and characteristics of each dataset, and to 
analyze how the settings of each social tagging system affect the folksonomies, 
we study and present statistics of the created datasets. 

It is worthwhile noting that, as we stated above, attaching tags to a bookmark 
is an optional step, so that depending on the social tagging site, a number of 
bookmarks may remain without tags. Table 4.3 on the next page presents the 
number of users, bookmarks and resources we gathered for each of the datasets, 
as well as the percent with attached annotations. In this work, as we rely on tag- 
ging data, we only consider annotated data, ruling out bookmarks without tags. 
Thus, from now on, all the results and statistics presented are based on annotated 
bookmarks. From these statistics, it stands out that most users (above 87%) pro- 
vide tags for bookmarks on Delicious, whereas there are fewer users who tend 
to assign tags to resources on LibraryThing and GoodReads (roughly 38% and 
17%, respectively). This shows the importance of Delicious' encouragement to 
adding tags, and GoodReads' disencouragement to this end, requiring the user 
to click twice on the book in order to add tags. The latter makes the tagging 
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process cumbersome, and yields a large number of untagged bookmarks. Li- 
braryThing is halfway between those two, which automatically conveys the user 
to the tagging form, but at a skippable second step after saving the book. 



Delicious 




Annotated 


Total 


Ratio 


Users 


1,618,635 


1,855,792 


87.22% 


Bookmarks 


273,478,137 


300,571,231 


91.00% 


Resources 


92,432,071 


102,828,761 


89.89% 


Tags 


11,541,977 




Library Thing 




Annotated 


Total 


Ratio 


Users 


153,606 


400,336 


38.37% 


Bookmarks 


22,343,427 


44,612,784 


50.08% 


Resources 


3,776,320 


5,002,790 


75.48% 


Tags 


2,140,734 




GoodReads 




Annotated 


Total 


Ratio 


Users 


110,344 


649,689 


16.98% 


Bookmarks 


9,323,539 


47,302,861 


19.71% 


Resources 


1,101,067 


1,890,443 


58.24% 


Tags 


179,429 





Table 4.3: Statistics on availability of tags in users, bookmarks, and resources for 
the three datasets. 

The crawling process enabled us to gather large amounts of bookmarks. Not 
all of them correspond to the resources with categorization data from experts, 
though. When gathering personomies, we also gathered lots of bookmarks for 
resources without categorization data. Table 4.4 on the facing page shows the 
statistics on resources' and bookmarks' belonging to the categorized or uncatego- 
rized subset of resources, according to the categorization data we gathered from 
expert-driven taxonomies. It can be seen that the number of categorized book- 
marks or resources is always much lower than the number of uncategorized ones. 
This enables to analyze a larger folksonomy as a whole for finding out tagging 
patterns on each site, in order to experiment afterward on the categorized subset. 

A first glance at the vocabulary employed in each folksonomy can be per- 
formed by looking at the top tags on each site. The top 10 of tags set by users for 
each of the datasets is listed in Table 4.5 on the next page. On one hand, top tags 
on Delicious include tags like design, software and blog, showing its com- 
puter and design related bias. On the other hand, top tags on LibraryThing and 
GoodReads share some similarities, where tags related to literary genres stand 
out. Moreover, the latter shows that non-fiction and nonf iction are two of 
the most popular tags, whereas they appear grouped for the former. 
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Resources 




Top level 


Second level 




Categ. 


Uncateg. 


Ratio 


Categ. 


Uncateg. 


Ratio 


Delicious (ODP) 


12,616 


92,419,455 


0.014% 


12,286 


92,419,785 


0.013% 


LibraryThing (DDC) 


23,617 


3,752,703 


0.629% 


22,409 


3,753,911 


0.597% 


Library Thing (LCC) 


24,861 


3,751,459 


0.636% 


23,566 


3,752,754 


0.628% 


GoodReads (DDC) 


23,617 


1,077,450 


2.192% 


22,409 


1,078,658 


2.077% 


GoodReads (LCC) 


24,861 


1,076,206 


2.310% 


23,566 


1,077,501 


2.187% 


Bookmarks 




Top level 


Second level 




Categ. 


Uncateg. 


Ratio 


Categ. 


Uncateg. 


Ratio 


Delicious (ODP) 


10,984,426 


262,493,711 


4.185% 


10,773,505 


262,704,632 


4.101% 


LibraryThing (DDC) 


4,266,445 


18,076,982 


23.602% 


4,238,774 


18,104,653 


23.413% 


LibraryThing (LCC) 


3,777,353 


18,566,074 


20.345% 


3,607,935 


18,735,492 


19.257% 


GoodReads (DDC) 


1,615,235 


7,708,304 


20.954% 


1,611,833 


7,711,706 


20.901% 


GoodReads (LCC) 


1,465,740 


7,857,799 


18.653% 


1,432,073 


7,891,466 


18.147% 



Table 4.4: Ratio of resources and bookmarks belonging to categorized or uncate- 
gorized data. The ratio value represents the percent of categorized bookmarks as 
compared to the uncategorized ones. 

Regarding the distribution of tags across all the resources, users and book- 
marks in the datasets, there is a clear difference of behavior among the three 
collections. Figure 4.1 on page 69 shows, on a logarithmic scale, the percent of 
resources, users and bookmarks on which tags are annotated according to their 
rank on the system. That is, the X axis refers to the percent of the tag rank, 
whereas the Y axis represents the percent of appearances in resources, users and 
bookmarks. For instance, if the tag ranked first had been annotated on the half of 
the resources, the value for the top ranked tag on resources would be 50%. Thus, 
these graphs enable to analyze how popular are the tags in the top as compared 



Delicious 


LibraryThing 


GoodReads 


design 


fiction 


fiction 


blog 


non-fiction 


fantasy 


tools 


fantasy 


non-fiction 


software 


history 


own 


webdesign 


mystery 


young-adult 


web 


science fiction 


classics 


reference 


read 


mystery 


programming 


biography 


romance 


music 


poetry 


wishlist 


web2 . 


novel 


nonf iction 



Table 4.5: Top 10 most popular tags on the datasets. 
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to the tags in the tail on each site. Figure 4.2 on page 70 shows the average usage 
of tags in a given rank for resources for each dataset. That is, we give a value of 1 
to the tag annotated the most on a resource, hence ranked first for that resource. 
The second tag is given the value according to the fraction of users annotating it 
as compared to the first one. And so on for tags ranked third, fourth,... on re- 
sources. Finally, we compute the average of tags ranked on each position, which 
is shown in the graph. It helps infer the popularity gap between top tags on 
resources and tags ranked lower. Looking at those two figures, and combining 
their meanings, it stands out that GoodReads has the highest usage of tags in the 
tail, but Delicious presents the highest usage of tags in the top. Delicious is the 
site with highest diversity of tags, where a few tags become really popular (both 
in the whole collection and on resources), and many tags are seldom-used. We 
believe that the reasons for these differences on tag distributions are: 

• Since Delicious suggests tags that have been annotated by previous users 
to a resource, it is obvious that those tags on the top are likely to happen 
more frequently, whereas others may barely be used. 

• LibraryThing and GoodReads do not suggest tags used by earlier users 
and, therefore, tags other than those in the top tend to be used more fre- 
quently than on Delicious. 

• GoodReads suggests tags from previous bookmarks of the same user, in- 
stead of tags that others assigned to the resource being tagged. Thus, 
this encourages reusing tags in their personomy making it remain with 
a smaller number of tags (see Table 4.6). In addition, users tend to assign 
fewer tags to a bookmark on average, probably due to the one-by-one tag 
insertion method of site's interface. 



# of tags 


Delicious 


LibraryThing 


GoodReads 


Per resource 


33.35 


14.53 


13.33 


Per user 


632.714 


357.15 


131.03 


Per bookmark 


3.75 


2.46 


1.55 



Table 4.6: Average counts of different tags. 

Regarding the distribution of tags across resources, users, and bookmarks, 
Figure 4.3 on page 71 shows percents of tags appearing more, equal or less fre- 
quently in an item (i.e., resources, users or bookmarks) than in another. It is 
obvious that a tag cannot appear in a smaller number of bookmarks than users or 
resources, by definition. Looking at the rest of data, it stands out that tags tend 
to appear in more bookmarks than users (b > w) and more resources than users 
(r > u) for GoodReads, due to the same feature that allows users to select among 
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Figure 4.1: Tag usage percentages in the collection. These 3 graphs represent, on 
a logarithmic scale for both x and y axes, the percent of annotations to resources, 
users, and bookmarks per tag rank. 

tags in their personomy. However, LibraryThing and Delicious have many tags 
present in the same number of bookmarks and users {b = u), and resources and 
users (r = u), even though the difference is more marked for the former site. 
This reflects the large number of tags that users utilize just once on these sites. 
All three sites have two features in common: there are a few exceptions of tags 
utilized by more users than the number of resources it appears in (r < u), and 
almost all the tags are present in the same number of bookmarks and resources 
(b — r). The latter, combined with the lower (b — u) values, means there is a 
large number of users spreading personal tags across resources that only have 
a bookmark with that tag, especially on GoodReads, but also for the other two 
sites. 

Finally, we analyze to what extent a bookmark introduces new tags into a 
resource that were not present in earlier bookmarks. Figure 4.4 on page 72 shows 
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Figure 4.2: Tag usage percentages on resources. Each tag rank represents the 
average usage of tags appearing in that position on resources as compared to the 
top ranked tag. 

these statistics for Delicious and LibraryThing. The same graph for GoodReads 
is not shown because neither the timestamp nor the ordering of the bookmarks 
is available in our dataset. The graph shows, on average, the ratio of new tags, 
not present in earlier bookmarks of a resource, assigned in bookmarks that rank 
from first to 100th bookmark, i.e., if tagi and tag2 were annotated in the first 
bookmark of a resource, and tag2 and tag3 in the second bookmark for the same 
resource, the ratio of novelty for the second bookmark is of 50%. It stands out the 
marked inferiority of tag novelty on Delicious as against to LibraryThing. This is, 
again, due to the tag suggestion policy of Delicious, what brings about a higher 
likelihood of reusing previously existing tags. 

4.5 Gathering Additional Data 

Besides all the aforementioned tagging data, we also gathered some more data 
about the categorized resources. We needed other data sources in order to per- 
form comparisons with tagging data along the experimentation. Specifically, we 
compare the usefulness of tagging data as against to other sources for the classi- 
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Figure 4.3: Tag distribution across resources (r), users (u) and bookmarks (b). 
Each bar represents the percent of tags that match the condition on X axis. 

fication task in Chapter 5 on page 75, and also require additional data to analyze 
the descriptiveness of tags in Chapter 7 on page 111. 

On one hand, we got the following data for the categorized URLs: 

• Self-content: it is the content of the web page itself, i.e., the HTML code 
fetched from the original URL. 

• Notes: a note can be defined as a free text describing the content of a web 
page. It is available on Delicious, and it is intended to provide a means to 
briefly summarize the aboutness of a web page. 

• Reviews: a review may be considered to be fairly similar to notes. How- 
ever, reviews as they were collected from StumbleUpon 12 , usually have a 
subjective bias, where users tend to valuate how they like the content of a 
web page. 

On the other hand, with regard to the categorized books, there is no easy way 
to get the content of the book. We did not have access to the books, since most of 
them are not freely available. Thus, we got the following metadata associated to 
the books: 



12 



http:/ /www.stumbleupon.com 
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Bookmark rank 



Figure 4.4: Novelty ratio of tags per rank of bookmark. 

• Synopses: a synopsis is a brief summary of the content of a book, which 
is usually printed on the back cover. We fetched synopses from the book 
retailer Barnes&Noble 13 . 

• Editorial reviews: summaries written by the publisher, or other profession- 
als, are considered as editorial reviews. We gathered them from Amazon 14 . 

• User reviews: we also collected reviews written by users on LibraryThing, 
GoodReads and Amazon, where they comment on the books with their 
summaries and thoughts. 

Since we do not have access to the self -content of the books, we will consider 
both synopses and editorial reviews as a summary of their contents. 

4.6 Conclusion 

We have studied the characteristics of several social tagging systems, and con- 
cluded with three sites that fulfill our requirements: Delicious, LibraryThing and 
GoodReads. We have created three large-scale social tagging datasets from these 
sites including millions of bookmarks, not only for web pages, but also for books, 

13 http: / / www.barnesandnoble.com 
14 http:/ /www.amazon.com 
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which enables further analyzing other kinds of resources. To the best of our 
knowledge, these are the largest social tagging datasets used for research so far. 
Also, we have analyzed the statistics of the datasets and the features of the un- 
derlying folksonomies. 

Even after we created these social tagging datasets, and made publicly avail- 
able parts of them 15 , little work has been done on creating more datasets and es- 
pecially on releasing them. In Korner and Strohmaier (2010), the authors present 
a list of publicly available social tagging datasets, among which our datasets are 
also included. However, the authors set out the problem of the unavailability of 
more datasets, and encourage researchers to create and release new ones. 

In this chapter, we have answered the following research question: 

Research Question 3 

How do the settings of social tagging systems affect users' annotations and the 
resulting folksonomies? 

To this end, we have analyzed several features that can be found in different 
settings of social tagging systems. Among the analyzed features, we have shown 
the impact of tag suggestions, which considerably alters the resulting folksonomy. 
In the studied social tagging sites, all of them differ on the settings regarding 
suggestions: 

• Resource-based suggestions (Delicious): when the system suggests tags 
assigned by other users to the resource at the time of bookmarking it, the 
likelihood of using new tags to further describe such a resource descreases. 
In this case, users provide less originality and tend to rely on system sug- 
gestions. 

• Personomy-based suggestions (GoodReads): when the system suggests 
tags previously used by the user, the vocabulary in their personomy tends 
to be much smaller. However, users do not know how others annotated a 
resource, and thus they are likely to provide new tags to the resource. 

• Without suggestions (LibraryThing): when the system does not suggest 
any tags to the user, the vocabulary in their personomy increases, as well 
as the diversity of tags in each resource. 

From now on, in Chapter 5 on page 75, Chapter 6 on page 95 and Chapter 7 
on page 111, we will use these three datasets for experimentation, and we will 
analyze in more depth their features and how they affect the performance of a 
resource classification task. 



15 http:/ / nlp.uned.es/ social-tagging/ datasets/ 
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"If we wish to discuss knowledge in the most highly developed contemporary society, we 
must answer the preliminary question of what methodological representation to apply to 
that society. " 

— Jean-Francois Lyotard 

In this chapter, we set out to propose and evaluate different rep- 
resentations of resources based on social tags for a resource clas- 
sification task. Each user contributing to the annotations on a 
resource provides their own tags, which commonly differ from 
others'. We explore different ways of representing large amounts 
of annotations provided by users and aggregated on resources 
on social tagging systems. We also measure the potential of 
social tags as compared to other data sources including self- 
content and user reviews, and analyze the suitability of com- 
bining them in search of a better performance of the classifier. 

This chapter is organized as follows. Next, in Section 5.1 on the 
following page we describe the way user annotations are aggre- 
gated on a resource to go into the problem. In Section 5.2 on 
page 77 we propose several representation approaches for social 
tags. Then, we present the results of the tag-based classification 
in Section 5.3 on page 79, and compare them to the results by 
other data sources in Section 5.4 on page 84. We describe the ex- 
periments on combining data sources in Section 5.5 on page 86, 
and conclude the chapter in Section 5.6 on page 90. 



The following research questions are addressed in this chapter: 
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Research Question 4 

What is the best way of amalgamating users' aggregated anno- 
tations on a resource in order to get a single representation for a 
resource classification task? 

Research Question 5 

Despite of the usefulness of social tags for these tasks, is it worth- 
while considering their combination with other data sources like 
the content of the resource as an approach to improve the results 
even more? 

Research Question 6 

Are social tags also useful and specific enough to classify re- 
sources into narrower categories as in deeper levels of hierarchi- 
cal taxonomies? 

5.1 Aggregation of User Annotations 

Social tagging systems allow users to annotate on resources that others have pre- 
viously annotated. This enables the aggregation of annotations provided by many 
users on the same resource. Obviously each user provides their own annotations, 
so that tags tend to be different from user to user. These annotations are listed 
all together in a detailed manner (see Table 5.1), and merged into a single list of 
top tags which summarizes the Full Tagging Activity (in the following, FTA) on 
a resource (see Table 5.2 on the next page). 



User annotations: Flickr.com 


User 1: 


photo, photography, images, pictures 


User 2: 


photo, web2 . 0, social, tools, blog 


User 3: 


cloud, pictures, sharing 


User 4: 


f lickr, photos 


User 5: 


photo, sharing, tool 



Table 5.1: Example of annotations for the URL Flickr.com on the social book- 
marking site Delicious. 

The tagging activity of a community of users on a resource creates an aggre- 
gated list of tags. A resource annotated by p users will have a list of n different 
tags, where each tag could have been assigned by p or fewer users. The num- 
ber of users who used a certain tag, Wt, defines a weight that allows to infer an 
ordered list of tags. 



5.2 Representing Resources Using Tags 



77 



Top tags: Flickr.com 


(79,681 users) 


photos 


22,712 


f lickr 


19,046 


photography 


15,968 


photo 


15,225 


sharing 


10,648 


images 


9,637 


web2 . 


9,528 


community 


4,571 


social 


3,798 


pictures 


3,115 



Table 5.2: Example of top tags for the URL Flickr.com on the social bookmarking 
site Delicious: the number associated to each tag represents the number of users 
annotating it. 

This aggregation of social tags was suggested as a means to feed the clas- 
sification of resources (Noll and Meinel, 2008a). However, to the best of our 
knowledge, no research work has been conducted on their application to an au- 
tomated classifier. Moreover, it is not clear what is a good way to represent the 
aggregation of tags. In this chapter, we will focus on these issues by proposing, 
analyzing and evaluating different representations for social tags so as to clas- 
sifying resources, and also comparing their performance to other data sources 
including self-content and user reviews. We perform such a study on two dif- 
ferent levels of the taxonomies, exploring thereby the suitability of social tags for 
broader and narrower categories. 

5.2 Representing Resources Using Tags 

We believe there are two major factors that should be considered for the repre- 
sentation of resources using social tags provided by users: (1) the selection of the 
tags that should be taken into account for the representation, and (2) the weights 
that should be assigned to those tags. 

On the one hand, as regards to the selection of tags, one could think that not 
all the tags are useful for the representation, but just those in the top that most 
users have chosen. An important feature of social tagging systems is the ability of 
users to coincide on some tags provided by others. Thus, the coincidence of user 
annotations, which is reflected on the top tags, can be considered as a consensus 
of the main tags that better fit the description of the resource. However, the 
diversity on the annotations can also give users the opportunity to assign seldom- 
used tags that further detail the resource. The latter could encourage considering 
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even tags in the tail. We will thus explore both tags in the top and the whole set. 

On the other hand, the weight of each tag must be defined appropriately. We 
propose 4 different ways of assigning those weights: 

• Tag ranks: the weight is assigned according to the position of a tag in the 
ranked top of tags. Tags corresponding to the top 10 list of a resource are 
assigned a value in a rank-based way. The first-ranked tag is always set 
the value 1, 0.9 for the second, 0.8 for the third, and so on. This approach 
respects the position of each tag in the top 10, but the different gaps among 
tag weights are ignored. 

• Tag fractions: the weight is computed according to the fraction of users 
who annotate a tag, u>t/ p, i.e., the number of users annotating a tag on a 
resource, divided by the total number of users who annotated the resource. 
Taking into account both the number of users who bookmarked a resource 
r and the weight of each tag zvt, it is possible to define the fraction of users 
assigning each tag. A tag would have been annotated by the totality of the 
users when its weight matches the user count of a resource, getting a value 
of 1 as the fraction. According to this, the value set to each tag is higher 
than (since the considered tags have annotated by at least one user), and 
can be up to 1 . This representation approach is similar to that by Noll and 
Meinel (2008a) for their analysis of the similarity between social tags and 
the classification by experts. However, they ignore the least popular tags 
by giving a value of 0, what may give rise to the removal of several tags 
from the representation. 

• Unweighted: in a binary way, the presence of a tag represents a value of 1, 
and its absence a value of 0. The only feature considered for this represen- 
tation is the occurrence or non-occurrence of a tag in the annotations of a 
resource. This approach thereby ignores the weights of tags, and assigns a 
binary value to each feature in the vector. 

• Weighted according to user counts: it considers the number of users as- 
signing the tag (iff) as a weight. The weight for each of the tags of a 
resource (w\,..., w n ) is considered as it is in this approach. Now, by def- 
inition, the weights of the tags are fully respected, although the amount 
of users bookmarking a resource is ignored. Note that different orders of 
magnitude are mixed up now, since the count of bookmarking users range 
within very different values. For instance, Ramage et al. (2009) used this 
approach in their work for clustering web pages, but they assumed it with- 
out comparing it to other representations. 

Table 5.3 on the facing page shows an example of annotations on a resource, 
and how each of the 4 weighting measures would look like for the example. 
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FTA 




Top 10 






h 


h 


h 




t 9 


ho 


hi 




t„ 


Ranks 


1 


0.9 


0.8 




0.2 


0.1 










Fractions 


0.5 


0.3 


0.2 




0.02 


0.01 


0.01 




0.01 


Unweighted 


1 


1 


1 




1 


l 


1 




1 


Weighted 


50 


30 


20 




2 


l 


1 




1 



Table 5.3: Example of the 4 representations of social tags on a resource annotated 
by 100 users, and tags ranked 1st, 2nd and 3rd were annotated by 50, 30 and 20 
users, respectively. 

Taking into account the factors above, we propose and analyze the 7 rep- 
resentation approaches summarized in Table 5.4. All 4 weighting measures are 
included, as well as two selections of tags: the FTA including the whole set of 
tags, and the top 10 tags of each resource including the best-weighted ones 1 . 
In the case of the rank-based weighting, we only apply it to the top 10 of tags, 
because it is defined to give a weight for only 10 tags. 





Top 10 


FTA 


Tag ranks 


X 




Tag fractions 


X 


X 


Unweighted 


X 


X 


Weighted 


X 


X 



Table 5.4: Summary of tag representations. 



5.3 Tag-based Classification 

According to the experimental results in Chapter 3 on page 47, we have used a 
multiclass SVM algorithm to perform the classification tasks, feeding the classifier 
with social tags from the three datasets introduced in Chapter 4 on page 59. We 
got different sizes of training sets for each dataset, and generated 6 different 
random selections for each size. We present the accuracy results corresponding 
to the average of those 6 runs. Results are split into separate tables, with a table 
corresponding to each dataset (Delicious, LibraryThing, GoodReads). Each table 

1 We selected the top 10 because it is usual to find that number of tags on social tagging 
systems. However, we could have chosen another value instead, yielding comparable 
conclusions. We provide additional results and information on this in Appendix A on 
page 143. 
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includes results for all 7 representations introduced above, and both top and 
second levels of the corresponding taxonomy. 



Delicious - ODP 


Top level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


Tag Ranks 


.462 


.501 


.493 


.501 


.498 


.501 


.484 


Tag Fractions (Top 10) 


.430 


.447 


.456 


.467 


.466 


.462 


.464 


Tag Fractions (FTA) 


.442 


.463 


.457 


.460 


.461 


.461 


.461 


Unweighted Tags (Top 10) 


.505 


.510 


.512 


.517 


.520 


.522 


.531 


Unweighted Tags (FTA) 


.530 


.556 


.566 


.572 


.569 


.571 


.572 


Weighted Tags (Top 10) 


.509 


.576 


.606 


.625 


.638 


.645 


.654 


Weighted Tags (FTA) 


.533 


.600 


.629 


.647 


.660 


.669 


.680 


Second level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


Tag Ranks 


.292 


.332 


.345 


.349 


.351 


.349 


.360 


Tag Fractions (Top 10) 


.262 


.280 


.297 


.304 


.315 


.317 


.349 


Tag Fractions (FTA) 


.249 


.279 


.294 


.308 


.302 


.302 


.336 


Unweighted Tags (Top 10) 


.315 


.340 


.354 


.351 


.348 


.365 


.361 


Unweighted Tags (FTA) 


.411 


.480 


.502 


.509 


.519 


.509 


.529 


Weighted Tags (Top 10) 


.342 


.432 


.475 


.497 


.517 


.532 


.545 


Weighted Tags (FTA) 


.359 


.453 


.498 


.522 


.541 


.556 


.568 



Table 5.5: Accuracy results for tag-based web page classification. 

Table 5.5 shows the results on the Delicious dataset. At a first glance, it is clear 
that rank-based and fraction-based approaches perform much worse than the rest. 
Among the others, the weighted approach performs better than the unweighted 
one, so that considering the number of users assigning each tag seems to be 
the best option. Accordingly, considering the total number of users annotating 
a resource does not seem helpful, as shown by the underperformance of the 
fraction-based approach. 

The unweighted approach may perform better than the weighted one for 
small training sets when it comes to the second level classification. It seems rea- 
sonable that the weighted approach requires more training instances to correctly 
represent the large diversity of possible values, and especially when the number 
of categories increases, as it happens on the second level. This is reflected in the 
underperformance of the weighted approach for the smaller training sets upon 
the second level of the taxonomy. However, the outperformance of the weighted 
approach becomes clear when the size of the training set increases. 

In most cases, FTA outperforms the top 10, even though the gap is not very 
large. This shows that top tags are the most useful, but the rest may also be 
helpful to a lesser extent. Accordingly, tags in the tail chosen by fewer users 
provide useful data that should not be discarded. The weighted approach on all 
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the tags performs the best for Delicious. 

Table 5.6 on the following page shows the results for the LibraryThing dataset, 
for DDC and LCC taxonomies. Similar to Delicious, the weighted one relying on 
the FTA is the outperforming approach, for both the top and second levels, and 
for both schemes. 

Different from Delicious, the FTA-based approaches are not always better 
than those based on the top 10. This difference happens when using the un- 
weighted approach for the top level classification. Many LibraryThing users tend 
to use personal tags describing whether or not they own the book (e.g., own), 
and the physical location of it (e.g., al). These tags are barely used within a 
book, as personal tags that spread across books by a single user. Thus, ignoring 
tag weights and giving all of them the same weight overrates those personal and 
low-ranked tags. This may increase the likelihood of books containing a certain 
personal tag to be mispredicted for the same category by the classifier. Accord- 
ingly, this is the main reason for the slight gap between the top 10 and FTA based 
representations for the weighted approach. Tags below the top 10 are not as use- 
ful as on Delicious. Fortunately, the weighted approach underrates such personal 
tags, and the classifier is able to discriminate them, profiting from some low- 
ranked tags to slightly improve the performance. This outperformance is larger 
on the second level, suggesting that low-ranked tags provide more detailed de- 
scription, and rather help for deeper classification. 

Even though the weighted approach is the best in this case, using rank-based 
weights performs better than the unweighted approach, different from Delicious. 
After all, the weighted approach performs the best also for this dataset. 

Table 5.7 on page 83 shows the results for the GoodReads dataset, for DDC 
and LCC taxonomies. Again, the FTA-based weighted approach is the best 
one, with clearly outperforming results for both top and second levels on both 
schemes, DDC and LCC. Different from LibraryThing, though, FTA-based ap- 
proaches perform better than top 10 based ones in most cases. This shows that 
GoodReads users tend to use fewer personal biased tags, making low-ranked tags 
much more useful than for LibraryThing. Despite these differences, the weighted 
approach is clearly the best approach for this dataset as well. 

For both LibraryThing and GoodReads, the results look very similar for both 
taxonomies, DDC and LCC. Even though the results are slightly better for the 
former, both yield similar conclusions when comparing the gaps between repre- 
sentation approaches. This strengthens the usefulness of the weighted approach 
regardless of the taxonomy being considered. 

Summarizing the results for the three datasets, the FTA-based weighted ap- 
proach has shown to be the best. Even though not every low-ranked tag seems 
useful for the classification task, the weighted approach is able to establish their 
representativity to the resource, getting the best results by using all the tags. 
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LibraryThing - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.791 


.783 


.778 


.782 


.788 


.787 


.797 


Tag Fractions (Top 10) 


.719 


.717 


.720 


.721 


.727 


.721 


.724 


Tag Fractions (FTA) 


.700 


.696 


.701 


.702 


.706 


.701 


.706 


Unweighted Tags (Top 10) 


.756 


.763 


.753 


.766 


.759 


.759 


.758 


Unweighted Tags (FTA) 


.624 


.622 


.628 


.629 


.629 


.628 


.624 


Weighted Tags (Top 10) 


.858 


.861 


.862 


.865 


.866 


.866 


.864 


Weighted Tags (FTA) 


.861 


.864 


.864 


.867 


.869 


.869 


.868 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.520 


.520 


.526 


.530 


.527 


.525 


.532 


Tag Fractions (Top 10) 


.511 


.513 


.511 


.513 


.513 


.517 


.521 


Tag Fractions (FTA) 


.465 


.474 


.469 


.470 


.470 


.472 


.477 


Unweighted Tags (Top 10) 


.507 


.538 


.538 


.532 


.543 


.528 


.539 


Unweighted Tags (FTA) 


.515 


.533 


.530 


.533 


.538 


.536 


.538 


Weighted Tags (Top 10) 


.679 


.687 


.696 


.696 


.701 


.701 


.704 


Weighted Tags (FTA) 


.690 


.700 


.707 


.709 


.715 


.712 


.715 


LibraryThing - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.783 


.790 


.788 


.783 


.789 


.795 


.790 


Tag Fractions (Top 10) 


.739 


.740 


.741 


.743 


.741 


.738 


.746 


Tag Fractions (FTA) 


.711 


.715 


.715 


.717 


.714 


.712 


.719 


Unweighted Tags (Top 10) 


.759 


.772 


.764 


.771 


.763 


.770 


.763 


Unweighted Tags (FTA) 


.654 


.660 


.661 


.661 


.658 


.655 


.661 


Weighted Tags (Top 10) 


.852 


.854 


.856 


.858 


.858 


.855 


.858 


Weighted Tags (FTA) 


.853 


.857 


.856 


.861 


.861 


.857 


.861 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.519 


.511 


.515 


.518 


.512 


.511 


.520 


Tag Fractions (Top 10) 


.414 


.413 


.413 


.415 


.417 


.411 


.417 


Tag Fractions (FTA) 


.408 


.409 


.408 


.410 


.410 


.409 


.410 


Unweighted Tags (Top 10) 


.542 


.568 


.564 


.565 


.579 


.550 


.576 


Unweighted Tags (FTA) 


.596 


.612 


.608 


.616 


.615 


.606 


.614 


Weighted Tags (Top 10) 


.687 


.710 


.716 


.720 


.721 


.722 


.727 


Weighted Tags (FTA) 


.703 


.725 


.729 


.734 


.734 


.736 


.739 



Table 5.6: Accuracy results for tag-based book classification (LibraryThing). 
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GoodReads - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.652 


.656 


.659 


.654 


.650 


.655 


.668 


Tag Fractions (Top 10) 


.660 


.658 


.662 


.663 


.671 


.659 


.664 


Tag Fractions (FTA) 


.654 


.653 


.657 


.658 


.665 


.655 


.659 


Unweighted Tags (Top 10) 


.647 


.645 


.643 


.650 


.639 


.657 


.647 


Unweighted Tags (FTA) 


.635 


.638 


.637 


.639 


.639 


.642 


.640 


Weighted Tags (Top 10) 


.728 


.730 


.736 


.742 


.739 


.740 


.740 


Weighted Tags (FTA) 


.745 


.747 


.754 


.757 


.757 


.757 


.756 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.435 


.439 


.434 


.447 


.445 


.443 


.447 


Tag Fractions (Top 10) 


.445 


.450 


.450 


.452 


.452 


.453 


.458 


Tag Fractions (FTA) 


.432 


.440 


.439 


.440 


.440 


.441 


.445 


Unweighted Tags (Top 10) 


.430 


.440 


.441 


.443 


.435 


.440 


.449 


Unweighted Tags (FTA) 


.450 


.460 


.447 


.454 


.453 


.458 


.452 


Weighted Tags (Top 10) 


.487 


.500 


.503 


.505 


.507 


.508 


.510 


Weighted Tags (FTA) 


.509 


.520 


.528 


.528 


.530 


.529 


.530 


GoodReads - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.625 


.636 


.629 


.630 


.632 


.630 


.631 


Tag Fractions (Top 10) 


.657 


.664 


.665 


.667 


.667 


.663 


.674 


Tag Fractions (FTA) 


.650 


.658 


.656 


.658 


.659 


.654 


.663 


Unweighted Tags (Top 10) 


.625 


.626 


.633 


.633 


.634 


.623 


.629 


Unweighted Tags (FTA) 


.642 


.648 


.653 


.651 


.647 


.639 


.653 


Weighted Tags (Top 10) 


.700 


.711 


.711 


.714 


.713 


.713 


.721 


Weighted Tags (FTA) 


.725 


.731 


.737 


.738 


.734 


.731 


.743 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tag Ranks 


.404 


.411 


.410 


.403 


.404 


.405 


.407 


Tag Fractions (Top 10) 


.412 


.421 


.426 


.427 


.430 


.427 


.427 


Tag Fractions (FTA) 


.418 


.427 


.431 


.432 


.433 


.432 


.433 


Unweighted Tags (Top 10) 


.414 


.419 


.420 


.415 


.414 


.422 


.435 


Unweighted Tags (FTA) 


.462 


.475 


.467 


.478 


.477 


.481 


.484 


Weighted Tags (Top 10) 


.467 


.479 


.487 


.486 


.491 


.491 


.493 


Weighted Tags (FTA) 


.494 


.507 


.510 


.514 


.513 


.517 


.519 



Table 5.7: Accuracy results for tag-based book classification (GoodReads). 
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Thereby, the aggregation of user annotations has shown to be crucial to define 
the representativity of a tag with respect to the annotated resource. 

5.4 Comparing Social Tags to Other Data Sources 

After we got the best representation approach to perform the classification ex- 
periments using social tags, we aimed at comparing their performance to that 
by other data sources. As we introduced previously in Chapter 4 on page 59, 
we gathered additional data for the resources we are working on, i.e., web pages 
and books. In both cases, we tried to gather two more types of data: content 
and reviews. Regarding web pages, we rely on the textual content contained in 
the HTML source and user reviews fetched from social networks. In the case of 
books, we consider synopses and editorial reviews as a summary of their content, 
and user-generated reviews on the other hand. 

With those content and user reviews, we created a representation based on 
the bag-of-words model (Harris, 1970). We merged all the texts available for each 
source, and created a single text with them. In order to clean up those texts, 
we stripped HTML tags, removed stop-words and stemmed the remaining words 
(Porter, 1980). Then, we weighted the words according to the TF-IDF scheme. 
The final representation of a resource, either based on content or user reviews, is 
a vector composed by words weighted by their TF-IDF values. 

We use the same method as above for the creation of different training set 
sizes with 6 runs. As both LibraryThing and GoodReads work on the same books, 
the content and user reviews are the same in these cases, so that we group their 
results into a single table. For the three datasets we work with, we show the 
results of using content and comments, and compare them to the best tag-based 
approach, that is, the FTA-based weighted approach. 



Delicious - ODP 


Top level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


Content 


.518 


.561 


.579 


.588 


.595 


.604 


.610 


Reviews 


.520 


.578 


.602 


.618 


.630 


.639 


.646 


Tags 


.533 


.600 


.629 


.647 


.660 


.669 


.680 


Second level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


Content 


.337 


.394 


.422 


.437 


.450 


.464 


.470 


Reviews 


.349 


.423 


.459 


.478 


.497 


.511 


.524 


Tags 


.359 


.453 


.498 


.522 


.541 


.556 


.568 



Table 5.8: Accuracy results comparing different data sources on web page classi- 
fication. 
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Table 5.8 on the preceding page shows the results for the Delicious dataset. 
In this case, self-content of web pages is the worst data source out of the three 
we studied. Results by self-content are far below from those by reviews and tags. 
Likewise, social tags are clearly the best data source for the classification task. 
There is a clear outperformance of tags for the top level, but the difference is even 
larger for the second level. This strengthens one of the main motivations of this 
thesis, i.e., the fact that self-content is not always representative of its aboutness, 
and other data sources can provide more accurate definitions. 



LibraryThing & GoodReads - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Content 


.767 


.792 


.802 


.809 


.809 


.815 


.817 


Reviews 


.777 


.808 


.820 


.831 


.833 


.839 


.840 


Tags (LibraryThing) 


.861 


.864 


.864 


.867 


.869 


.869 


.868 


Tags (GoodReads) 


.745 


.747 


.754 


.757 


.757 


.757 


.756 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Content 


.572 


.612 


.631 


.643 


.649 


.657 


.660 


Reviews 


.582 


.628 


.651 


.667 


.678 


.685 


.693 


Tags (LibraryThing) 


.690 


.700 


.707 


.709 


.715 


.712 


.715 


Tags (GoodReads) 


.509 


.520 


.528 


.528 


.530 


.529 


.530 


LibraryThing & GoodReads - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Content 


.767 


.789 


.798 


.803 


.806 


.807 


.810 


Reviews 


.780 


.803 


.816 


.823 


.827 


.828 


.833 


Tags (LibraryThing) 


.853 


.857 


.856 


.861 


.861 


.857 


.861 


Tags (GoodReads) 


.725 


.731 


.737 


.738 


.734 


.731 


.743 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Content 


.579 


.620 


.645 


.658 


.668 


.673 


.681 


Reviews 


.581 


.637 


.664 


.683 


.698 


.705 


.712 


Tags (LibraryThing) 


.703 


.725 


.729 


.734 


.734 


.736 


.739 


Tags (GoodReads) 


.494 


.507 


.510 


.514 


.513 


.517 


.519 



Table 5.9: Accuracy results comparing different data sources on book classifica- 
tion. 

Table 5.9 shows the results for books, using tags from LibraryThing and 
GoodReads. In this case, we got results similar to Delicious when using tags 
from LibraryThing. Again, user reviews outperform the content (although we 
considered synopses and editorial reviews as a summary of the content of the 
book in this case). Moreover, social tags perform even better than user reviews, 
especially for the second level classification. The results are comparable for both 
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classification schemes, DDC and LCC. 

However, using tags from GoodReads is not enough to achieve results as good 
as using content or user reviews. GoodReads tags clearly underperform the other 
data sources. For this dataset, reviews are the data source scoring the best results. 
We believe that this happens because most GoodReads users do not provide tags 
when bookmarking a book 2 . This way, a community providing fewer annotations 
gives rise to a less accurate aggregation of tags. 

Summarizing, tags show to be really powerful as compared to other data 
sources like the content of the resource, or user reviews on it. However, large 
amounts of annotations are necessary in order to score outperforming results. 

5.5 Getting the Most Out of All Data Sources 

Even though the tag-based representation outperforms in most cases the other 
two data sources, namely content and user reviews, all of them yield encouraging 
results and look good enough to combine them and try to improve even more the 
classifier's performance. The following questions arise from this statement: what 
if a classifier is guessing correctly while the others are making a mistake? Could 
we combine the predictions to get the most out of each of them? 

An interesting approach to combine SVM classifiers is known as classifier 
committees (Sun et al., 2004). Classifier committees rely on the predictions of 
several classifiers, and combine them by means of a decision function, which 
serves to define the weight or relevance of each classifier in the final prediction. 
After applying the decision function on the predictions of all classifiers, a single 
unified prediction can be inferred. 

An SVM classifier outputs a margin for each resource over each class in the 
taxonomy, meaning the reliability to belong to that class. The class with the 
largest positive margin for each resource is then selected as the classifier's pre- 
diction. The larger is the gap between the largest positive and the rest of margins, 
the more reliable can be considered the classifier's prediction. Thus, combining 
the predictions of SVM classifiers could be done by means of adding up their 
margins or reliability values for each class. Each resource will then have a new 
reliability value for each class, i.e., the sum of margins by different classifiers for 
a resource. Nonetheless, in this case, since each of the three classifiers work with 
different type of data, the range and scale of the margins they output differ. To 
solve this, we propose the normalization of the margins based on the maximum 

2 GoodReads does not encourage users to add tags as LibraryThing does, requiring 
a second click from the user, what brings about a large set of unannotated bookmarks, 
representing a ratio of more than 80% bookmarks (see Section 4.4 on page 65) 
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margin value outputted by each classifier, max(m,-) (see Equation 5.1). 





where m^ c is the margin by the classifier i between the resource j and the 
hyperplane for the class c, and m' • is its value after normalizing it. 



The class maximizing this sum of margins will be predicted by the classi- 
fier. Then, the sum of margins between the class c and the resource j using a 
committee with n classifiers is defined by Equation 5.2. 



If the classifiers are working over k classes, then the predicted class for the 
resource j will be defined by Equation 5.3. 



As a toy example of the possible advantage of using classifier committees, 
Table 5.10 on the following page shows the outputs in the form of margins of 
two classifiers for a resource in a taxonomy with 3 categories. Let this resource 
belong to the category #2. The example shows that, on one hand, the classifier 
A has predicted the category #1, with a margin of 1.2, but a slight gap to the 
category #2 which gets a margin of 1.1. On the other hand, the classifier B says 
that the resource should be classified in category #3 because of a margin of 1.2 
was returned, but the gap is again slight as compared to the category #2 with a 
margin of 1.0. The classifier committees would consider all the outputs by adding 
margins up in order to return a new margin value for the resource upon each 
category. As a result, committees get the largest margin value for the category 
#2 with a 2.1, as compared to the 1.8 for the category #3 and 1.7 for the category 
#1. Hence, both classifiers on their own were wrong classifying this resource, but 
their prediction criteria were good enough to merge them with other classifiers. 
The fact that the actual category for the resource was predicted in second place 
for both classifiers gives rise to the correct classification when using committees. 

Next, we show the results of using classifier committees on separate tables 
for each dataset. Note that the tag-based approach is also included, in order to 
enable comparing the performance of committees to it. 

Table 5.11 on the next page shows the results of using classifier committees on 
Delicious. The effect of using committees on this dataset is really positive, because 
all of the combinations considering tags outperform the tag-based classifier. The 
committees considering only reviews and content may perform worse than tags 
on their own. Those committees considering the tag-based classifier are the three 
best. Even though tags positively combine with content and reviews separately, 



n 



= E 



(5.2) 



C* = argmax(S ;! ) 



(5.3) 
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Category #1 


Category #2 


Category #3 


Classifier A 


1.2 


1.1 


0.6 


Classifier B 


0.5 


1.0 


1.2 


Classifier committees 


1.7 


2.1 


1.8 



Table 5.10: Example of classifier committees, where both classifiers mispredict 
the category of the resource. One of them predicts category #1, whereas the other 
predicts category #3. However, it should actually be classified on category #2, 
which is correctly predicted when adding margins up by using classifier commit- 
tees. 



Delicious - ODP 


Top level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


Tags 


.533 


.600 


.629 


.647 


.660 


.669 


.680 


Content + Reviews 


.554 


.604 


.627 


.642 


.651 


.660 


.670 


Content + Tags 


.580 


.633 


.655 


.671 


.678 


.687 


.696 


Reviews + Tags 


.561 


.618 


.644 


.662 


.675 


.685 


.694 


Content + Reviews + Tags 


.581 


.632 


.655 


.671 


.681 


.691 


.699 


Second level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


Tags 


.359 


.453 


.498 


.522 


.541 


.556 


.568 


Content + Reviews 


.382 


.450 


.486 


.505 


.522 


.538 


.547 


Content + Tags 


.409 


.488 


.528 


.547 


.564 


.578 


.587 


Reviews + Tags 


.389 


.474 


.512 


.534 


.555 


.571 


.584 


Content + Reviews + Tags 


.412 


.488 


.524 


.545 


.564 


.579 


.588 



Table 5.11: Accuracy results of classifier committees for web page classification. 

combining all three data sources provides a slight improvement as compared to 
the other two. 

Reviews perform better than content on their own, but the latter performs 
better when combined with tags. This shows that even though content performs 
worse, it provides more reliable predictions than reviews, performing better on 
committees. Nonetheless, relying on all three data sources performs the best in 
most cases for both levels of the taxonomy. 

Table 5.12 on the facing page shows the results of using classifier committees 
on LibraryThing. These results show the great potential of tags provided by users 
on this social tagging system. Committees combining data sources not always 
outperform the sole use of tags. However, combining them with user reviews 
gives rise to higher performance, especially for the second level classification. 

On the other hand, using content on committees yields inferior results. This 
shows that besides performing worse on its own, content is not good enough in 
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LibraryThing - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.861 


.864 


.864 


.867 


.869 


.869 


.868 


Content + Reviews 


.778 


.803 


.814 


.821 


.823 


.827 


.830 


Content + Tags 


.823 


.842 


.845 


.849 


.851 


.852 


.852 


Reviews + Tags 


.857 


.866 


.868 


.872 


.875 


.876 


.876 


Content + Reviews + Tags 


.824 


.843 


.847 


.852 


.855 


.856 


.856 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.690 


.700 


.707 


.709 


.715 


.712 


.715 


Content + Reviews 


.589 


.631 


.652 


.663 


.670 


.679 


.684 


Content + Tags 


.645 


.672 


.688 


.695 


.700 


.706 


.707 


Reviews + Tags 


.687 


.708 


.717 


.721 


.729 


.729 


.733 


Content + Reviews + Tags 


.647 


.677 


.693 


.701 


.705 


.713 


.713 


LibraryThing - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.853 


.857 


.856 


.861 


.861 


.857 


.861 


Content + Reviews 


.777 


.800 


.808 


.814 


.818 


.817 


.824 


Content + Tags 


.787 


.806 


.815 


.819 


.824 


.821 


.830 


Reviews + Tags 


.831 


.845 


.853 


.856 


.861 


.859 


.864 


Content + Reviews + Tags 


.791 


.811 


.820 


.826 


.831 


.827 


.838 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.703 


.725 


.729 


.734 


.734 


.736 


.739 


Content + Reviews 


.600 


.648 


.674 


.690 


.705 


.704 


.719 


Content + Tags 


.640 


.677 


.698 


.709 


.723 


.720 


.738 


Reviews + Tags 


.688 


.723 


.736 


.746 


.754 


.755 


.766 


Content + Reviews + Tags 


.645 


.685 


.708 


.721 


.733 


.732 


.750 



Table 5.12: Accuracy results of classifier committees for book classification (Li- 
braryThing). 
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this case to feed classifier committees. Probably, using synopses and editorial 
reviews as a summary of the content because of the unavailability of the actual 
content of the book makes it insufficient to get solid results. 

Table 5.13 on the next page shows the results of using classifier committees 
on GoodReads. In this case, tags on their own were not strong enough to reach 
the results by content or user reviews. However, the committees considering 
tags perform the best, showing their high reliability when it comes to combining 
predictions. 

As it happened with LibraryThing, content does not seem to be a reliable 
source for committees. Combining it with reviews and tags yields similar or even 
worse results than excluding it. Combining both reviews and tags is the best 
option again for the top level of the taxonomies, as for LibraryThing. Surprisingly, 
this combination produces results almost as good as using LibraryThing tags, 
which perform far better on their own. This shows that even though tags from 
GoodReads are not accurate enough on their own, they provide reliable margins 
to be considered on committees. 

When comparing taxonomies, DDC and LCC, neither GoodReads nor Li- 
braryThing shows any differences as compared to the other, proving that the 
conclusions are the same regardless of the classification scheme. 

Summarizing, tags have shown great potential, not only as a source to classify 
on their own, but also to provide reliable prediction criteria to take into consid- 
eration for combining them with other data sources. Moreover, in some cases 
like on GoodReads, tags were not good enough on their own, but have shown 
to be a solid data source when used with classifier committees. Nonetheless, the 
data source used to combine with tags must be solid enough and provide reliable 
predictions to get better results. When data sources are selected appropriately, 
the performance improvement can be considerable. In this regard, we have seen 
that the synopses and reviews we chose as a summary of the content of books 
provide inappropriate predictions. 

5.6 Conclusion 

In this chapter, we have carried out a deep experimentation and performed a 
thorough analysis on the use of social tags as a source to feed resource classi- 
fiers. We have compared the performance of using social tags to that by using 
other data sources like the content or user reviews gathered from social media. 
The experiments have been applied to the three large-scale social tagging datasets 
introduced in Chapter 4 on page 59, gathered from tagging sites with different 
settings and annotated resources, which allow to conclude with more generalis- 
tic thoughts. Classification experiments have been realized with annotated web 
pages over the ODP taxonomy, and annotated books over the DDC and LCC 
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GoodReads - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.745 


.747 


.754 


.757 


.757 


.757 


.756 


Content + Reviews 


.778 


.803 


.814 


.821 


.823 


.827 


.830 


Content + Tags 


.797 


.822 


.831 


.837 


.838 


.844 


.845 


Reviews + Tags 


.820 


.847 


.857 


.865 


.867 


.872 


.874 


Content + Reviews + Tags 


.806 


.831 


.842 


.849 


.851 


.854 


.857 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.509 


.520 


.528 


.528 


.530 


.529 


.530 


Content + Reviews 


.589 


.631 


.652 


.663 


.670 


.679 


.684 


Content + Tags 


.594 


.633 


.652 


.662 


.671 


.676 


.680 


Reviews + Tags 


.610 


.651 


.670 


.683 


.691 


.696 


.705 


Content + Reviews + Tags 


.611 


.651 


.672 


.683 


.689 


.698 


.702 


GoodReads - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.725 


.731 


.737 


.738 


.734 


.731 


.743 


Content + Reviews 


.777 


.800 


.808 


.814 


.818 


.817 


.824 


Content + Tags 


.793 


.814 


.823 


.829 


.831 


.832 


.836 


Reviews + Tags 


.831 


.836 


.847 


.853 


.857 


.857 


.864 


Content + Reviews + Tags 


.801 


.825 


.833 


.839 


.844 


.843 


.850 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


Tags 


.494 


.507 


.510 


.514 


.513 


.517 


.519 


Content + Reviews 


.600 


.648 


.674 


.690 


.705 


.704 


.719 


Content + Tags 


.608 


.649 


.672 


.684 


.692 


.696 


.703 


Reviews + Tags 


.624 


.674 


.696 


.712 


.725 


.730 


.735 


Content + Reviews + Tags 


.626 


.674 


.699 


.713 


.728 


.727 


.742 



Table 5.13: Accuracy results of classifier committees for book classification 
(GoodReads). 
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taxonomies. The great potential shown by social tags, for both the top and sec- 
ond levels of taxonomies, can be strengthened by combining the predictions with 
other data sources. However, not all data sources are strong enough to perform 
well at combining predictions, so that the selection of data sources should be 
done appropriately. 

Parts of the research in this chapter have been published in Zubiaga et al. 
(2009d), Zubiaga et al. (2009c) and Zubiaga et al. (2011a). 

By means of these experiments, we provided an answer to the following re- 
search questions: 

Research Question 4 

What is the best way of amalgamating users' aggregated annotations on a resource 
in order to get a single representation for a resource classification task? 

We have shown that it is worthwhile considering all the tags annotated on 
a resource instead of those in the top that were annotated most. Tags in the 
top are the most important, and give the main information on the aboutness 
of resources. However, tags in the tail are helpful to a lesser extent, providing 
meaningful information and improving the performance of the classifier. 

Regarding the weights assigned to those tags when representing a resource, 
the number of users annotating each tag should be considered in order to get the 
best results. This is the value that has shown the best results in our experiments. 
It has outperformed other approaches ignoring weights or considering other data 
such as the total number of users annotating the resource. 

Thereby, the best representation in our experiments is the one that includes all 
the tags with the values corresponding to the number of users annotating them. 

Research Question 5 

Despite of the usefulness of social tags for these tasks, is it worthwhile considering 
their combination with other data sources like the content of the resource as an 
approach to improve the results even more? 

By means of classifier committees, which combine the predictions by differ- 
ent classifiers, we have shown that tags provide reliable prediction criteria to take 
into consideration. SVM classifiers not only predict a category, but also assign a 
weight to each category based on the given resource. These weights, given in the 
form of margin values, can be used by other classifiers which rely on different 
data sources. Adding up weights provided by different classifiers can help predict 
the correct category when a single classifier fails to categorize the resource ap- 
propriately. Weights provided by classifiers relying on social tags are especially 
useful when combining them with results from other classifiers. Nonetheless, 
not all data sources are helpful for combination in classifier committees, and the 
selected data source must be solid enough and provide reliable predictions to 
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outperform the sole use of tags. When data sources are selected appropriately, 
the performance improvement can be considerable. We have shown that this 
varies among datasets. For example, with the Delicious dataset, it is important 
to analyze all three data sources (content, reviews, and tags). However, with the 
LibraryThing and GoodReads datasets reviews and tags suffice. 

Research Question 6 

Are social tags also useful and specific enough to classify resources into narrower 
categories as in deeper levels of hierarchical taxonomies? 

We have analyzed the usefulness of social tags for classification on two differ- 
ent levels of hierarchical taxonomies. Besides broader categories in the top level, 
we have also explored the classification on narrower categories in the second 
level. In this regard, social tags have shown to outperform the other data sources 
on social tagging sites that encourage users to annotate resources (Delicious and 
LibraryThing). Tags show clear outperformance in these cases, especially on De- 
licious, where the difference is even more favorable in the second level. This 
difference is very similar on LibraryThing. Finally, tags from GoodReads do not 
outperform other data sources at any level because the system does not encourage 
users to tag books, so that many bookmarks are not annotated. 

Our findings provide a different conclusion from that by Noll and Meinel 
(2008a), where the authors pointed out the hypothesis that social tags were prob- 
ably useless for deeper levels of taxonomies, and alternative data should be used 
instead. However, the authors performed just a statistical analysis, and did not 
confirm the hypothesis with real experiments. 
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Analyzing the Distribution of Tags for 

Resource Classification 



Statistics will prove anything, even the truth. 
— Noel Moynihan 



In this chapter, we deal with the task of considering the repre- 
sentativity of tags for resource classification within a collection 
of social annotations on a social tagging system. To the best 
of our knowledge, no effort has been invested so far on estab- 
lishing the representativity of tags when it comes to finding the 
aboutness of resources. In this regard, we explore how the dis- 
tribution of tags across the three dimensions involved in a social 
tagging system (namely users, resources and bookmarks) can 
determine their representativity. To this end, we study and ana- 
lyze the effectiveness of applying an IDF-like distribution-driven 
weighting scheme in search of performance improvements in a 
resource classification task. We define three analogous weight- 
ing schemes -IUF, IRF and IBF- which rely on distributions of 
tags across users, resources and bookmarks, respectively. They 
have been barely used for social tagging, and their usefulness 
has not yet been proven. 
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The chapter is organized as follows. Next, in Section 6.1 we mo- 
tivate the problem of considering tag distributions as a means to 
determine the representativity of tags. Then, in Section 6.2 on 
the facing page we describe the TF-IDF weighting scheme and 
its use on classical documents collections, and introduce analo- 
gous schemes adapted to social tagging systems in Section 6.3 
on page 98. We present a set of experiments -tag-based classifi- 
cation, classifier committees, and correlation between weighting 
measures-, and analyze and study the results in Section 6.4 on 
page 100. Finally, in Section 6.5 on page 109 we conclude the 
chapter. 

We address the following research questions in this chapter: 

Research Question 7 

Can we further consider the distribution of tags across the col- 
lection so that we can measure the overall representativity of 
each tag to represent resources? 

Research Question 8 

What is the best approach to weigh the representativity of tags 
in the collection for resource classification? 

6.1 Tag Distributions 

So far, we have explored the ways of amalgamating great deals of user annota- 
tions provided in the form of social tags, in order to find a suitable representation 
of a resource. We considered the weighting of a tag with respect to the resource 
where it was annotated, but we did not explore further into the representativity 
of tags within the whole collection. We have considered that two tags with the 
same number of users annotating it on a resource have the same representativity 
for the resource, because they had the same number of annotators and, there- 
fore, they were assigned the same weight. However, they do not strictly have to 
represent the same representativity. 

From a statistical point of view, we believe that the distribution of tags across 
the whole collection has much to do with the overall representativity of tags. By 
representativity, we refer to the weight setting how important is a certain tag 
when it comes to representing a resource for its classification. Accordingly, we 
believe that a tag that concentrates within a few resources or has been used by a 
few users is rather representative than a tag present in most resources or used by 
most users. Even if two tags have the same overall use within the collection, the 
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way they are distributed across users, resources and bookmarks may determine 
whether they are focused and precise, or they are spread and imprecise instead. 

To this end, a collection-aware weighting scheme like the well-known TF- 
IDF seems to be a good alternative. We believe it is suitable to determine the 
representativity of tags considering their distribution across the collection. We 
found that it had been hardly applied to a social structure like that by tagging 
systems. Its adaptation from a classical text collection, where the only dimensions 
are terms and documents, to a collection of bookmarks, where tags spread across 
users, resources and bookmarks, remains unstudied. Moreover, its usefulness for 
tag-based resource classification has not yet been explored. 



6.2 TF-IDF as a Term Weighting Function 

TF-IDF is a term weighting function that serves as a statistical measure defining 
the importance of a word to a document in a collection (Salton et al., 1975; Saltan 
and Buckley, 1988). When computing the TF-IDF value for the term i within the 
document j as a part of a document collection D, it comprises two underlying 
measures: (1) the term frequency (TF), i.e., the number of appearances of the 
term i within the document j, and (2) the inverse document frequency (IDF), i.e., 
the inverse of the number of documents within the whole set of documents D in 
which the term i occurs, which refers to the general importance of the term i in 
the collection (see Equation 6.1). The product of these two measures defines the 
TF-IDF weight of term i in the document j (see Equation 6.2). 

idfi=i °z \{d!uld } \ {eA) 

tf-idfjj = tfij x idfi (6.2) 

Integrating the IDF factor allows to rate lower or higher such a term de- 
pending on its distribution across the collection. This weighting function yields 
a higher value when the term i occurs in a few documents, considering that it 
is of utmost representativity to those documents. On the other hand, the value 
will be lower when the term i occurs in many documents of the collection, con- 
sidering that it rather spreads across the collection instead of focusing in a few 
documents. In the latter case, the value becomes null when the term i occurs in 
all the documents. 

This weighting scheme has been widely used for Information Retrieval, Text 
Mining and Text Classification, and it is commonly used for term selection tasks. 
There is controversy on its appropriateness for text classification (Lan et al., 2005; 
Forman, 2008), since it does not consider the relations between the terms and 
their appearance in the categories. However, it has shown high effectiveness in 
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several text classification tasks (Joachims, 1998; Yang and Liu, 1999; Brank et al., 
2002; Dumais et al., 1998). 

There are some works that study the adaptation of TF-IDF to text and web 
page classification tasks. They consider the distribution of terms across categories 
in the training set as a value to determine the representativity of a term. For in- 
stance, in Forman (2008) and Lan et al. (2005) the authors compared some feature 
scoring metrics, including TF-IDF, in a text classification problem using a linear 
SVM. Each of them proposed a new term weighting function that outperformed 
TF-IDF in their experiments. Other works, such as Debole and Sebastiani (2003) 
and Soucy (2005), propose the use of supervised weighting techniques instead of 
unsupervised ones for text classification tasks. 

Even though alternatives to TF-IDF like those mentioned above have been 
proposed and successfully applied to specific tasks and collections, they have 
barely been used subsequently. TF-IDF continues to be the most widely used 
term weighting scheme, and has become a "de facto" standard for document 
representation. 

In this chapter, we rely on TF-IDF as the base weighting to propose analo- 
gous schemes adapted to social tagging systems. Even though we could rely on 
alternatives, our main goals are (1) to perform a study on its adaptability to these 
structures, and (2) to find out how the settings of social tagging systems affect 
the resulting tag distributions and thereby the values of such weights. Thus, we 
will not include any category data in the calculation of the weights. 

6.3 Tag Weighting Functions Based on Inverse Fre- 
quencies 

Unlike classical collections of web documents or library catalogs, where the dis- 
tribution of terms across documents on the collection has been studied, social 
tagging systems comprise more dimensions to explore into. Besides the distri- 
bution of tags across documents or annotated resources, different users set those 
tags within different bookmarks. These two characteristics are new on social 
tagging with respect to classical text document collections. Despite this clear dif- 
ference in the nature of social tagging systems, not enough attention has been 
paid at analyzing how each of the dimensions -resources, users and bookmarks- 
affects tag distributions and, therefore, establishing tag relevances. 

TF-IDF has widely been applied to text collections, and has proven to be 
beneficial for a large number of tasks. Text collections are mainly made up by 
terms written by the authors, though, and the appropriateness of using a similar 
approach for a collection made up by tags annotated by users other than the 
authors on a social environment is not clear. 
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Next, we introduce three tag weighting approaches, taking the classical TF- 
IDF approach to the social tagging scenario, and adapting it to rely on resources, 
users and bookmarks. These three dimensions suggest the definition of that many 
tag weighting functions considering inverse resource frequency (IRF), inverse 
user frequency (IUF), and inverse bookmark frequency (IBF) values, respectively. 
These three approaches follow the same function for the tag i within the resource 
j (see Equation 6.3). 



TF-IxFfj = tfij ■ ixf (6.3) 

where tfy is the number of occurrences of the tag i in the resource j, and ixf 
is the inverse frequency function considered in each case, irf, iuf or ibf, thus x 
being r, u, or b. 



6.3.1 TF-IRF 

This is the application of the TF-IDF approach to a social tagging system with 
annotated resources, considering that resources are analogous to documents in 
this case. Tags that are widely spread across resources are penalized with low 
weights and, vice versa, tags within fewer resources are considered relevant with 
a higher weight. Thus, the function outputs the logarithm of the total number 
of resources divided by the number of resources in which the tag is present (see 
Equation 6.4). 



fr/ < =iog i { r : rui (6 - 4) 

It has previously been used in a few works in the social tagging literature, 
even though they usually referred to this approach as TF-IDF. Angelova et al. 
(2008) rely on this measure to infer similarity of tags by creating a tag graph, 
weighting the TF-IDF value of each user to a tag. Shepitsen et al. (2008) and Liang 
et al. (2010) use this measure to represent the resources in a recommendation 
system where resources are recommended to users. The latter concluded that 
although both TF-IDF and TF have identical trends, the former provides superior 
results in their recommendation task. Likewise, Ramage et al. (2009) compared 
TF-IDF and TF for clustering web pages, and showed a superiority for the former. 
However, they did not pay attention at the effect of tag distributions on these 
weightings, and they showed the usefulness of TF-IDF just for a specific case. Li 
et al. (2008) create tag vectors using TF-IDF to compute the similarity between 
two documents annotated on Delicious. They assumed this weighting measure, 
and they did not pay attention at whether or not it was appropriate. 
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6.3.2 TF-IUF 

As a new dimension present in social tagging systems, the number of users using 
each of the tags could also be significant to know whether a tag is representative 
to a collection of resources. Thus, we consider that a tag used by many users is 
not as representative as a tag that fewer users are utilizing (see Equation 6.5). 

iufi=]0S \{u:ulu}\ (6 - 5) 
This function was inferred from a previous application to a collaborative fil- 
tering system Breese et al. (1998). With the aim of recommending resources to 
users, Diederich and Iofciu (2006) and Liang et al. (2010) rely on the IUF for dis- 
covering similarities among users. The latter use both IUF and IRF to represent 
users and resources, respectively, but no comparison is performed among their 
characteristics. In Abbasi et al. (2009), TF-IUF is used along with TF-IRF over 
Flickr tags and user groups for finding landmark photos. They concluded that 
their approach was effective to find landmark photos on Flickr, but they did not 
study whether or not relying on those weighting measures was appropriate. 



6.3.3 TF-IBF 

This is a similar inverse weighting function relying on the third dimension in 
which tags are distributed: bookmarks. This function considers that a tag that 
has been used in many bookmarks is not as relevant to represent a resource as 
others that have been assigned to fewer bookmarks (see Equation 6.6). 

ibfi = ]as \{b:ue B}\ (6 - 6) 

To the best of our knowledge, this tag weighting scheme has never been used 
so far. Even though all three frequencies can somehow be related, there are 
substantial differences among them. A tag used by many users can spread across 
many resources, or it can just congregate in a few resources. Likewise, this factor 
might affect the number of bookmarks. 



6.4 Experiments 

Next, we present the classification experiments that enable (1) to analyze how 
each of the proposed tag weighting functions contributes to the classification of 
annotated resources, as well as (2) to discover whether either of the inverse tag 
weighting approaches outperforms the baseline relying only on the tag frequency 
(TF). In order to further analyze their usefulness and suitability, we also experi- 
mented on their performance when applied to classifier committees. Finally, we 
analyze the correlation between the different tag weighting functions. 
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6.4.1 Tag-based Classification 

The first experiment focuses on evaluating the usefulness of tag weighting func- 
tions for a resource classification task. We perform this evaluation by comparing 
tag-based representations by using each of the three weighting functions -TF-IRF, 
TF-IUF and TF-IBF- and the absence of distributional weighting functions (TF). 
Note that the latter is the same as the FTA-based weighted approach we con- 
cluded as the best representation in Chapter 5 on page 75, and it is thus the up- 
to-now outperforming approach. This experiment uses an SVM with the same 
settings as those defined in previous Chapter (see Section 5.3 on page 79). We 
show the results for all three datasets, and 4 different representations, including 
the three weighting measures and TF. 



Delicious - ODP 


Top level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


TF 


.533 


.600 


.629 


.647 


.660 


.669 


.680 


TF-IRF 


.516 


.571 


.593 


.607 


.619 


.631 


.639 


TF-IBF 


.519 


.573 


.596 


.611 


.622 


.633 


.641 


TF-IUF 


.528 


.580 


.607 


.625 


.636 


.653 


.661 


Second level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


TF 


.359 


.453 


.498 


.522 


.541 


.556 


.568 


TF-IRF 


.344 


.424 


.463 


.486 


.506 


.518 


.529 


TF-IBF 


.348 


.429 


.467 


.489 


.509 


.520 


.532 


TF-IUF 


.358 


.437 


.478 


.502 


.523 


.541 


.555 



Table 6.1: Accuracy results of tag-based web page classification using weighting 
schemes. 

Table 6.1 shows the results of using tag weighting functions on Delicious. 
It can be seen that the use of inverse weighting functions is not useful in this 
case. In the contrary, their use harms the performance of the classifier, yielding 
inferior results than those obtained by the TF approach not considering weighting 
functions. Going further into the analysis of the performance of representations 
relying on weighting functions, the results show that IUF gets the best results 
among them, followed by IBF, and then IRF This happens for both top and second 
levels of the taxonomy in a similar manner. 

Our conjecture about this is that resource-based tag suggestions provided 
by Delicious are not helpful to this end. We have already shown in Chapter 4 
on page 59 that such a feature alters the structure of the folksonomy on Deli- 
cious. It makes the top tags become even more popular and it alters the natural 
distribution of tags. Thus, such a forced distribution of tags produces weights 
that score lower performances. Moreover, the fact that IUF is the best weighting 



102 



Analyzing the Distribution of Tags for Resource Classification 



function in this case shows the importance of users who make their own choices 
instead of relying on suggestions. That is, users who are able to choose their own 
tags and differ from those relying on suggestion-based annotations give rise to 
higher weights for their seldom tags. When users rely on suggestions, it does 
not make any difference on the IRF values of tags, because the frequency remains 
unchanged. This difference is also little for IBF values. However, it makes a big 
difference on IUF values, because those suggestions increase the user frequen- 
cies of tags and thus reduce IUF values. Accordingly, users who make their own 
choices yield higher IUF values because it is likely that their tags are not being 
used that many times. Probably, IUF would perform better than TF if there were 
fewer users who rely on system suggestions, and hence more users providing 
their own tags instead. 



LibraryThing - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.861 


.864 


.864 


.867 


.869 


.869 


.868 


TF-IRF 


.877 


.889 


.894 


.897 


.900 


.902 


.902 


TF-IBF 


.877 


.889 


.894 


.897 


.900 


.903 


.904 


TF-IUF 


.881 


.891 


.895 


.897 


.899 


.901 


.900 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.690 


.700 


.707 


.709 


.715 


.712 


.715 


TF-IRF 


.723 


.750 


.762 


.768 


.774 


.777 


.780 


TF-IBF 


.723 


.751 


.763 


.770 


.775 


.779 


.781 


TF-IUF 


.729 


.751 


.761 


.766 


.771 


.771 


.776 


LibraryThing - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.853 


.857 


.856 


.861 


.861 


.857 


.861 


TF-IRF 


.867 


.883 


.887 


.893 


.895 


.894 


.897 


TF-IBF 


.867 


.883 


.888 


.893 


.896 


.895 


.898 


TF-IUF 


.871 


.882 


.885 


.892 


.893 


.892 


.894 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.703 


.725 


.729 


.734 


.734 


.736 


.739 


TF-IRF 


.751 


.780 


.793 


.803 


.804 


.809 


.814 


TF-IBF 


.751 


.781 


.796 


.805 


.806 


.811 


.818 


TF-IUF 


.754 


.780 


.790 


.798 


.800 


.803 


.807 



Table 6.2: Accuracy results of tag-based book classification using weighting 
schemes (LibraryThing). 

Table 6.2 shows the results of using tag weighting functions on LibraryThing 
over DDC and LCC schemes. In this case, all the inverse weighting functions 
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are clearly superior to TF, since the former always outperform the latter. Even 
though the outperformance is much larger for the second level, the superiority of 
weighting functions is clear for both levels. This shows that the studied inverse 
weighting functions can be really useful for folksonomies created in the absence 
of suggestions. Inverse tag weighting functions have successfully set suitable 
weights towards a definition of the representativity of tags in this case, in contrast 
to Delicious. 

Among the tag weighting functions, all of them perform similarly, and no 
clear outperformances can be seen in these results. However, IBF seems to pro- 
vide slightly better results than the other two approaches, followed by IRF. IUF 
is the worst function in this case, suggesting that the number of users choosing 
each tag is not the most relevant feature when there are no suggestions. 



GoodReads - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.745 


.747 


.754 


.757 


.757 


.757 


.756 


TF-IRF 


.800 


.808 


.813 


.817 


.816 


.817 


.816 


TF-IBF 


.800 


.809 


.814 


.817 


.817 


.818 


.818 


TF-IUF 


.797 


.805 


.810 


.814 


.813 


.814 


.814 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.509 


.520 


.528 


.528 


.530 


.529 


.530 


TF-IRF 


.579 


.599 


.609 


.612 


.617 


.619 


.621 


TF-IBF 


.583 


.602 


.614 


.618 


.624 


.626 


.628 


TF-IUF 


.578 


.598 


.609 


.613 


.619 


.620 


.623 


GoodReads - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.725 


.731 


.737 


.738 


.734 


.731 


.743 


TF-IRF 


.781 


.792 


.797 


.801 


.802 


.799 


.804 


TF-IBF 


.781 


.792 


.797 


.803 


.802 


.800 


.805 


TF-IUF 


.776 


.788 


.792 


.797 


.797 


.794 


.800 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.494 


.507 


.510 


.514 


.513 


.517 


.519 


TF-IRF 


.578 


.599 


.608 


.617 


.618 


.622 


.627 


TF-IBF 


.582 


.605 


.615 


.625 


.625 


.628 


.634 


TF-IUF 


.576 


.600 


.610 


.619 


.620 


.623 


.628 



Table 6.3: Accuracy results of tag-based book classification using weighting 
schemes (GoodReads). 

Table 6.3 shows the results of using inverse tag weighting functions over DDC 
and LCC schemes on GoodReads. Similar to LibraryThing, tag weighting func- 
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tions clearly outperform the sole use of TF. Moreover, these outperformances are 
even superior than for LibraryThing. As on LibraryThing, IBF performs the best 
among the weighting functions, followed by IRF, and then IUF 

Even though there are also system suggestions on GoodReads, they rely on 
tags previously used by the user, i.e., their personomy and thus these sugges- 
tions can only be applied to different bookmarks and resources. Thereby, those 
users who tend to choose new tags instead of reusing tags from their personomy 
are yielding more natural bookmark frequencies. This affects and helps IBF per- 
form better, but has no impact on IUF, as it is not altered by personomy-based 
suggestions. This shows that the effect of personomy-based suggestions in much 
smaller, and it affects to a lower extent or does not almost affect the distribution of 
tags, because suggestions do not spread to the users. Accordingly, the studied tag 
weighting functions perform well when this type of suggestion exists. On both 
LibraryThing and GoodReads, the results for the different classification schemes, 
DDC and LCC, are comparable and show a similar trend. 

Summarizing, results show that the studied inverse tag weighting functions 
can be really useful for determining the representativity of each tag within the 
collection. However, folksonomies can suffer from resource-based tag sugges- 
tions, transforming the structure and distributions of folksonomies. This trans- 
formation can even be harmful for the definition of tag weighting functions, and 
can bring about worse performance results than simply relying on TF, as hap- 
pened on Delicious. Otherwise, in the absence of resource-based tag suggestions, 
the use of tag weighting functions contribute in a positive manner to the perfor- 
mance of the classifier. 

Comparing the results scored by tag weighting functions, it can be seen that 
IBF is always slightly better than IRF. The former is more detailed than the lat- 
ter, because it considers the exact number of appearances of the tag besides the 
number of resources it appears in. Actually, IBF is the best approach for both Li- 
braryThing and GoodReads, where there are no suggestions, or suggestions rely 
on user's personomy. When these suggestions rely on tags previously annotated 
by others to the resource, as on Delicious, IUF performs better than the other 
two weighting functions, showing the relevance of the ability of users to dismiss 
suggestions. However, even IUF is not able to outperform TF in this case. 

6.4.2 Revisiting Classifier Committees 

Apart from the results scored using tag weighting functions and the comparison 
of their performance to that by relying on TF, it is interesting to analyze their 
appropriateness to combine with other data sources. As we did in Chapter 5 on 
page 75, we use classifier committees to evaluate the ability of the approaches 
using tag weighting functions to be combined with content and / or reviews, and 
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improve even more the performance of the classifier. This time, we rely on the 
best committees for each datasets, i.e., the triple combination of tags, content 
and reviews for Delicious, and the double combination of tags and reviews for 
LibraryThing and GoodReads. We run them using the 4 different weightings 
for tags: TF, TF-IBF, TF-IRF, and TF-IUF, i.e., those compared in the previous 
section as well. By using classifier committees upon these weightings, we aim at 
analyzing how well they perform not only on their own, but also providing their 
prediction criteria when combining with other data sources. 



Delicious - ODP 


Top level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


TF 


.581 


.632 


.655 


.671 


.681 


.691 


.699 


TF-IRF 


.576 


.629 


.653 


.669 


.680 


.690 


.697 


TF-IBF 


.576 


.630 


.653 


.670 


.680 


.690 


.698 


TF-IUF 


.576 


.631 


.654 


.672 


.682 


.692 


.700 


Second level 




600 


1400 


2200 


3000 


4000 


5000 


6000 


TF 


.412 


.488 


.524 


.545 


.564 


.579 


.588 


TF-IRF 


.406 


.485 


.523 


.546 


.566 


.580 


.592 


TF-IBF 


.407 


.486 


.525 


.548 


.566 


.580 


.592 


TF-IUF 


.408 


.488 


.526 


.548 


.569 


.584 


.595 



Table 6.4: Accuracy results of classifier committees for web page classification 
using weighting schemes. 

Table 6.4 shows the classification results of the approaches considering in- 
verse tag weighting functions on classifier committees for Delicious. Even though 
inverse tag weighting functions were not useful to improve the performance of 
the tag-based classifier reducing its overall accuracy, they seem to provide better 
decisions to be combined with other data sources. The predictions and margins 
outputted by all three approaches using inverse weighting functions yield slightly 
better results on the classification, especially when it comes to second level clas- 
sification. Thereby, inverse tag weighting functions are useful for Delicious when 
their outputs are applied on classifier committees along with content and reviews. 
The unsuitability of tag weighting functions on their own gets fixed by the use of 
classifier committees. However, the little outperformance by tag weighting func- 
tions when using committees is almost irrelevant as compared to the TF-based 
committees. This outperformance is slightly clearer for largest training sets upon 
the second level classification. 

Table 6.5 on the next page shows the classification results of the approaches 
considering tag weighting functions on classifier committees for LibraryThing 
over DDC and LCC schemes. In this case, inverse tag weighting functions are also 
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LibraryThing - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.857 


.866 


.868 


.872 


.875 


.876 


.876 


TF-IRF 


.864 


.882 


.886 


.890 


.894 


.897 


.897 


TF-IBF 


.865 


.883 


.887 


.891 


.894 


.897 


.898 


TF-IUF 


.865 


.883 


.886 


.889 


.892 


.894 


.895 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.687 


.708 


.717 


.721 


.729 


.729 


.733 


TF-IRF 


.709 


.742 


.754 


.765 


.770 


.773 


.778 


TF-IBF 


.710 


.742 


.756 


.767 


.772 


.776 


.780 


TF-IUF 


.712 


.741 


.752 


.763 


.767 


.769 


.775 


LibraryThing - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.831 


.845 


.853 


.856 


.861 


.859 


.864 


TF-IRF 


.849 


.869 


.876 


.880 


.887 


.885 


.890 


TF-IBF 


.851 


.871 


.879 


.882 


.888 


.887 


.892 


TF-IUF 


.852 


.869 


.875 


.880 


.886 


.885 


.888 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.688 


.723 


.736 


.746 


.754 


.755 


.766 


TF-IRF 


.712 


.750 


.770 


.782 


.789 


.793 


.803 


TF-IBF 


.717 


.755 


.773 


.786 


.792 


.797 


.806 


TF-IUF 


.719 


.754 


.770 


.781 


.788 


.792 


.801 



Table 6.5: Accuracy results of classifier committees for book classification using 
weighting schemes (LibraryThing). 

useful when applied to classifier committees when compared to the TF-based one. 
Those classifier committees including tag weighting functions produce clearly 
better results than the committee using TF. This performance improvement is 
positive for both levels, but it is larger for the second level. However, those ap- 
proaches using tag-based representations with tag weighting functions perform 
better on their own, without considering committees (see Table 6.2 on page 102). 
That is, it is better to use the classifier based on tags on their own, without includ- 
ing the predictions by the classifier using reviews. This means that predictions 
with tag weighting functions are good enough to work on their own, and it is 
better to ignore the other data source, i.e., reviews, which cannot catch up with 
the performance of tags and harm the overall performance. 

Table 6.6 on the next page shows the classification results of the approaches 
considering tag weighting functions on classifier committees for GoodReads over 
DDC and LCC schemes. The main conclusions drawn from these results are very 
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GoodReads - DDC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.820 


.847 


.857 


.865 


.867 


.872 


.874 


TF-IRF 


.835 


.859 


.867 


.874 


.877 


.881 


.884 


TF-IBF 


.837 


.861 


.868 


.876 


.878 


.882 


.885 


TF-IUF 


.834 


.858 


.866 


.873 


.876 


.881 


.883 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.610 


.651 


.670 


.683 


.691 


.696 


.705 


TF-IRF 


.637 


.676 


.693 


.707 


.716 


.719 


.726 


TF-IBF 


.642 


.681 


.697 


.711 


.719 


.723 


.730 


TF-IUF 


.638 


.677 


.694 


.708 


.717 


.722 


.727 


GoodReads - LCC 


Top level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.831 


.836 


.847 


.853 


.857 


.857 


.864 


TF-IRF 


.826 


.846 


.856 


.861 


.866 


.864 


.870 


TF-IBF 


.829 


.848 


.858 


.863 


.868 


.866 


.871 


TF-IUF 


.826 


.845 


.856 


.860 


.866 


.864 


.869 


Second level 




3000 


6000 


9000 


12000 


15000 


18000 


21000 


TF 


.624 


.674 


.696 


.712 


.725 


.730 


.735 


TF-IRF 


.647 


.697 


.716 


.732 


.742 


.748 


.757 


TF-IBF 


.651 


.700 


.720 


.736 


.746 


.751 


.759 


TF-IUF 


.648 


.698 


.718 


.733 


.744 


.749 


.757 



Table 6.6: Accuracy results of classifier committees for book classification using 
weighting schemes (GoodReads). 

similar to those on LibraryThing. Again, committees relying on tag weighting 
functions perform clearly better than that relying on TF, especially for the second 
level. However, it is better to ignore the other data source, i.e., reviews, since 
the results by tags on their own are good enough and cannot be improved by 
combining them (see Table 6.3 on page 103). 

Again, using classifier committees obtains comparable results with very sim- 
ilar trends for both book taxonomies, DDC and LCC. 

Summarizing the results for all three datasets, the use of tag weighting func- 
tions has shown to be helpful in all cases as compared to TF when it comes to 
combining them with other data sources using classifier committees. However, 
it is better to rely only on the tag-based classifier for both LibraryThing and 
GoodReads, which score good results on their own, and they get harmed when 
combined with other data sources. In the case of Delicious, on the other hand, 
the use of classifier committees for approaches relying on tag weighting functions 
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perform better results than using tags on their own, and than the committees re- 
lying on TF. Nonetheless, the latter performs just slightly worse, and their results 
are very similar, suggesting that any of them could be used to perform the task. 

6.4.3 Correlation between Tag Weighting Functions 

All three inverse tag weighting functions consider the distribution of tags across 
different dimensions. The values given by these three functions could correlate 
or not depending on the behavior of users, e.g., if many tags annotated by a 
large number of users congregate into the same resource, correlation between 
IUF and IRF would be lower than if each of the users annotate those tags in 
different resources. Thus, analyzing whether these three values correlate is of 
utmost importance. 

Table 6.7 shows correlation values among tag weighting schemes. The cor- 
relation values between each pair of functions are shown in each row, for both 
the Pearson and Spearman correlation coefficients. Note that the latter considers 
the rank inferred from tag weights, whereas the former considers the values to 
compute correlations. Both correlation values range from -1 to 1. The closer is 
this value to 0, the less correlation exists among the compared sets and, thus, the 
more independent they are. 





Delicious 


Library Thing 


GoodReads 




r 


P 


r 


P 


r 


P 


IRF-IUF 


.763 


.657 


.679 


.603 


.529 


All 


IRF-IBF 


.991 


.990 


.989 


.981 


.997 


.998 


IUF-IBF 


.780 


.677 


.720 


.630 


.556 


.436 



Table 6.7: Pearson (r) and Spearman (p) correlation coefficients. 

Correlation values show that there is a high dependence among IBF and IRF 
values. Both seem to be fully dependent and, thus, that is why these two ap- 
proaches achieve very similar results. The correlation decreases when IUF is con- 
sidered, so that it seems to be more independent to the rest. This independence is 
clearest for GoodReads, and intermediate for LibraryThing, but Delicious shows 
the highest dependence of IUF with respect to the other two values. The main 
reason for the clear independence of IUF on GoodReads is that users are sug- 
gested by the system with tags in their personomy, so that they easily spread 
tags on bookmarks and resources, keeping the user frequency unchanged. As Li- 
braryThing and Delicious do not have this feature, IUF correlates with the others 
to a greater extent. 
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6.5 Conclusion 

In this chapter, we have studied and analyzed the application of tag weighting 
functions based on the classical IDF scheme for the resource classification task 
on the three large-scale datasets introduced in Chapter 4 on page 59. We have 
considered the distributions of tags across users, resources and bookmarks to 
generate three variations of such weighting, namely IBF, IRF and IUF We have 
performed classification experiments by considering their results on their own, 
and by combining them with other data sources using classifier committees. We 
have analyzed their results by taking into account the settings of each social tag- 
ging system, and how it affects the distribution of tags of their underlying folk- 
sonomies. 

With these experiments, we have given an answer to the following research 
questions: 

Research Question 7 

Can we further consider the distribution of tags across the collection so that we can 
measure the overall representativity of each tag to represent resources? 

We have analyzed the suitability of IDF-like weighting functions to define 
the representativity of tags, which consider the distribution of tags through the 
whole collection of resources. Our experiments have shown that these functions 
helps improve performance of a resource classification task. However, we have 
shown that the settings of the social tagging system have an effect on those dis- 
tributions. Resource-based tag suggestions have shown to influence the structure 
of folksonomies greatly. Suggesting tags based on previous annotations of others 
on the resource causes a very different tag distribution, which in turn, affects the 
results of the weighting function. When a system enables the resource-based tag 
suggestions, the use of tag weighting functions performs worse, and combining 
with other data sources is required to improve performance; this method can 
even outperform the TF-based approach. 

For our classification experiments, we have found that IDF-like weighting 
functions clearly outperform the TF approach when resource-based tag sugges- 
tions are not enabled, i.e., on LibraryThing and GoodReads, both when used on 
their own, or when combined with other data sources. We found it better to 
consider just the tag-based approach, without combining them with other data 
sources, since it provides superior results, which cannot be improved by combin- 
ing them with other predictions. 

Research Question 8 

What is the best approach to weigh the representativity of tags in the collection for 
resource classification? 
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Among the studied weighting functions, the one relying on bookmark fre- 
quencies has shown to be the best when there are no resource-based tag sugges- 
tions. In these cases, IBF performs the best, followed by IRF, and IUF. All of them 
clearly outperform TF, when both used on their own, and combined with other 
data sources using classifier committees. 

On the other hand, when the social tagging system suggests tags to the user 
relying on the resource itself, IUF performs better than the others. IUF performs 
better than IBF and IRF, because of the importance of the ability of users to choose 
their own tags without relying on suggestions from these systems. Even though 
IUF does not outperform TF when used on its own, combining it with other data 
sources produces the best approach. However, it is only slightly better than the 
committees relying on TF, and any of them can be used to score similar results. 



7 

Analyzing the Behavior of Users for 

Classification 



'Always imitate the behavior of the winners when you lose." 
— George Meredith 

In this chapter, we explore the behavior of users on social tag- 
ging systems. Earlier works have suggested and shown that 
users of these systems follow different goals, and they tag re- 
sources for a certain purpose. Several classifications have been 
proposed to discriminate user behavior. Specifically, we consider 
one of those classifications of behavioral purposes. Such classi- 
fication splits user behavior into two goals: (a) users who aim at 
maintaining an organizational structure of the resources for later 
browsing, so-called Categorizers, and (b) users who rather pro- 
vide detailed descriptions for later search, so-called Describers. 
These two user behaviors yield different personomy structures, 
i.e., they follow a different tagging pattern, which produces dif- 
ferent tag selections from each other. 

Such a classification of users has been previously experimented, 
and has shown its effectiveness to discover users who rather de- 
scribe resources, i.e., Describers. However, the appropriateness 
of Categorizers for a resource classification task has not yet been 
studied. Upon this, we set out the study of the suitability of 
users who fit such behavior, by performing a set of resource 
classification and descriptiveness experiments. To this end, we 
split the whole set of users into smaller subsets of utmost Cat- 
egorizers and Describers. We explore how each subset of users 
better fits the classification or descriptiveness task. 
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This chapter is organized as follows. Next, in Section 7.1 we 
briefly summarize the research work found so far on the user 
motivation to tagging, and motivate its interest towards our 
work on resource classification. In Section 7.2 on the facing page 
we detail in more depth one of those classifications of user be- 
havior, which separates Categorizers from Describers. Then, in 
Section 7.3 on page 115 we present the settings of our resource 
classification experiments enhanced by the detection of user be- 
havior, and present their results in Section 7.4 on page 119. Fi- 
nally, we conclude the chapter in Section 7.5 on page 123. 

We address the following research questions in this chapter: 

Research Question 9 

Can we discriminate different user profiles so that we can find a 
subset of users who provide annotations that better fit a classifi- 
cation scheme? 

Research Question 10 

What are the features that identify a user as a good contributor 
to the resource classification? 

7.1 User Behavior on Social Tagging Systems 

It has been suggested that not all the users contributing on social tagging sys- 
tems are motivated by the same goal for annotating resources. Depending on 
their annotations, several works propose different classifications of user behavior 
(Korner et al., 2010b). Some of them focus on detecting the types of tags pro- 
vided by users. For instance, early works such as Golder and Huberman (2006) 
and Sen et al. (2006) propose the existence of several tag types. On the other 
hand, others have suggested discriminating user behavior by their annotations. 
In this regard, works such as Marlow et al. (2006b), Heckner et al. (2009), Nov 
et al. (2009) and Strohmaier et al. (2010a) propose differentiating users by their 
motivation for tagging resources. 

As a classification of user behavior that matches our requirements, we focus 
on the latter by Strohmaier et al. (2010a). In this work, the authors propose 
differentiating two kinds of user behavior: Categorizers, who rather organize 
resources, and Describers, who rather define the contents of resources. It seems 
reasonable that users so-called Categorizers may provide annotations that better 
fit the resource classification task than Describers. Next, we detail in more depth 
these two types of users. 
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Categorizer 


Describer 


Goal of Tagging 


later browsing 


later retrieval 


Change of Tag Vocabulary 


costly 


cheap 


Size of Tag Vocabulary 


limited 


open 


Tags 


subjective 


objective 



Table 7.1: Characteristics of Categorizers and Describers. 

7.2 Categorizers vs Describers 

The approach we consider for discriminating users by their behavior has been 
introduced and experimented in earlier works (Korner et al., 2010a,b; Korner, 
2009). They consider the existence of two major tagging motivations on social 
tagging systems: Categorizers and Describers. 

Early works such as Marlow et al. (2006b); Hammond et al. (2005a) and Heck- 
ner et al. (2009) suggest that a distinction between at least two types of user mo- 
tivations for tagging is interesting: on one hand, users can be motivated by cat- 
egorization (in the following, Categorizers). These users view tagging as a means 
to categorize resources according to some (shared or personal) high-level con- 
ceptualizations. They typically use a rather elaborated tag set to construct and 
maintain a navigational aid to the resources for later browsing. On the other 
hand, users who are motivated by description (so-called Describers) view tagging 
as a means to accurately and precisely detail resources. These users tag because 
they want to produce annotations that are useful for later search and retrieval. 
The development of a personal, consistent ontology to navigate across their re- 
sources is not their intuition. Table 7.1 gives an overview of characteristics of the 
two different types of users, based on Korner (2009). 

7.2.1 Measures 

We use three different measures to differentiate users into Categorizers and De- 
scribers: Tags Per Post (TPP), Tag Resource Ratio (TRR), and Orphan Ratio (OR- 
PHAN). Additional measures are shown in Korner et al. (2010b), but due to the 
high correlation with the others, we limited our efforts to the ones above. These 
measures rely on two features of user behavior: verbosity, which measures the 
number of tags a user tends to use when annotating, and diversity, which mea- 
sures the extent to which users are using new tags that were not previously ap- 
plied by themselves. It is worthwhile noting that these measures provide one 
value for each user. The measure corresponding to each user is thus computed 
by considering the characteristics of their bookmarks and the attached tag assign- 
ments. The resulting measures are then ranked in a list along with the rest of 
the users. This list makes possible inferring the extent to which a user is rather a 
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Categorizer or a Describer. 

7.2.1.1 Tags per Post (TPP) 

As a Describer would focus on describing their resources in a very detailed man- 
ner, the number of tags used to annotate each resource can be taken into account 
as an indicator to identify the motivation of the analyzed user. The tags per post 
measure (short TPP) captures this by dividing the number of all tag assignments 
of a user by the number of resources (see Equation 7.1). T ur is the number of tags 
annotated by a user u on a resource r, and R u is the number of resources of the 
user. The more tags a user utilizes to annotate the resources, the more likely they 
are a Describer, reflecting it in a higher TPP score. 

r 

TPP(u) = E^fi (7.1) 

This measure relies on the verbosity of users, as it computes the average 
number of tags they assigned to bookmarks. 

7.2.1.2 Orphan Ratio (ORPHAN) 

Since Describers do not have a fixed vocabulary and freely choose tags to describe 
their resources in a detailed manner, they would not focus on reusing tags. This 
factor is analyzed in the orphan ratio (short ORPHAN). This measure relates the 
number of seldom used tags to the total number of tags. Equation 7.2 shows how 
seldom used tags are defined by the individual tagging style of a user. In this 
equation, t max denotes the most frequent tag of the user. Equation 7.3 shows the 
calculation of the final measure where T° are seldom used tags and T„ are all tags 
of the given user. Users with more seldom tags yield a higher orphan ratio, and 
they are more likely to be Describers. 



\R(tmax) I 
100 



(7.2) 



ORPHAN (u) = = {t\\R(t)\ < n} (7.3) 

\J-u\ 

By measuring whether users frequently use the same tags or rather rely on 
new ones, the ORPHAN ratio considers their diversity. 



7.2.1.3 Tag Resource Ratio (TRR) 



The tag resource ratio (short TRR) relates the number of tags of a user (i.e., the 
size of their vocabulary) to the total number of annotated resources (see Equation 



7.3 Calculation of Measures and Experiment Settings 



115 



7.4). A typical Categorizer would use a small number of tags as compared to the 
number of resources and would therefore score a low TRR value. 

TRR(u) = M (7.4) 

This measure relies on both verbosity, because users who use more tags in 
each bookmark would usually result in a higher TRR value, and diversity, as 
those who frequently use new tags will have a larger vocabulary. Nonetheless, 
the latter has a higher impact in this case, since the former could be altered by 
verbose users who tend to reuse tags. 



7.3 Calculation of Measures and Experiment Settings 

Users of each social tagging site have their own weights for each of the three 
measures above. Thus, we computed TPP, ORPHAN and TRR values for each 
user. This way, we are able to generate three ranked lists of users for each site. 
In these rankings, Categorizers rank high, whereas Describers rank low (this is 
arbitrary and could be inverted as well). From these lists, we can select a subset 
of users in the top as Categorizers, and another subset in the tail as Describers. 
Both sets should have the same size in order to compare them. 

Our main goal is to conclude whether these measures can discriminate Cat- 
egorizers in such a way that they perform better than Describers on a resource 
classification task. However, we also perform experiments measuring the descrip- 
tiveness of users' tags in order to conclude whether Describers perform better to 
that end. With the subsets of Categorizers and Describers defined above, we per- 
form classification and descriptiveness experiments to know how suitable they 
are for each of the tasks. 

Table 7.2 on the following page shows the distribution of the three measures 
we calculated for users on the three datasets. The X axis represents quantiles of 
values, whereas Y axis represents the number of users belonging to each quantile. 
Note that the values themselves are not relevant, but just allows us to rank each 
user and analyze where they fall in the distribution of all weights. On one hand, 
the TRR measure follows a similar distribution for all three datasets. On the 
other hand, for the other two measures, the distributions show there are lots 
of extreme users for LibraryThing and GoodReads: (1) the ORPHAN measure 
shows both many extreme Categorizers and extreme Describers, and (2) the TPP 
measure discriminates a large set of extreme Categorizers, but almost no extreme 
Describers. These distributions change drastically on Delicious, though. For this 
dataset, there are many users who have middle values, and who are not that 
clearly discriminated as Describers or Categorizers. 

To choose the sets of users with which we perform the experiments, we split 
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Table 7.2: Distribution histograms of the three measures (TRR, ORPHAN and 
TPP) for the three datasets. X axis represents the quantiles of values, whereas Y 
axis represents the number of users in each quantile. 



7.3 Calculation of Measures and Experiment Settings 



117 



the ranked lists by getting some of the top and bottom users. Choosing fixed 
percents of users would be unfair, though. Some users are likely to be more 
verbose, by definition of some measures, and they usually provide much more 
tag assignments than others. Thus, we split the users according to the percent of 
tag assignments they provide 1 . This enables a fairer split of the users, with the 
same amount of data, e.g., a 10% split ensures that both sets include 10% of all tag 
assignments, but the number of users differs among them. Figure 7.1 shows an 
example of how splitting by number of tag assignments can differ from splitting 
by number of users. We split the user sets into smaller subsets of users ranging 
from 10% to 100%, with a step size of 10%. 

50% split according to tag assignments 

' D D'D D D D 

t t t t t t 

5 tags 3 tags 2 tags 2 tags 2 tags 2 tags 

50% split according to number of users 

Figure 7.1: Example of a 50% segment, selected based on tag assignments or 
number of users. Splitting by number of users would be unfair, since it may yield 
bigger amounts of data. 

7.3.1 Tag-based classification 

For the tag-based classification, we represent the resources by aggregating an- 
notations provided by the users within the considered subset of Categorizers or 
Describers. This creates reduced tagging data for each resource. With these re- 
duced representations, we feed the multiclass SVM classifier defined in Chapter 3 
on page 47, and calculate their performance by measuring the accuracy of their 
predictions. This enables comparing same percents of tag assignments by Cate- 
gorizers and Describers, in order to analyze whether the former outperform the 
latter. 

7.3.2 Descriptiveness of Tags 

To compute the extent to which a subset of users is providing descriptive tags, 
we compare their tags to the descriptive data of resource. These descriptive data 
include: 

1 We define each of the tags annotated in a bookmark as a tag assignment. Thus, a 
bookmark has as many tag assignments as tags has the user annotated on it. 
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• The textual content of the web pages, as well as user reviews for the Deli- 
cious dataset. 

• Synopses, user reviews and editorial reviews for the book datasets, i.e., 
LibraryThing and GoodReads. 

In the first step, we merge all these data into a single text for each resource. 
Accordingly, we get a single text comprising all descriptive data for each resource. 
After this, we compute the frequencies of each term (TF) in the texts, so that we 
can create a vector for each resource, where each of the dimensions in the vectors 
belongs to a term. On the other hand, for each selection of users, we create the 
vectors of tags for each resource, with the annotations of those users. This way, 
we have the reference descriptive vectors as well as the tag vectors we want to 
compare to them. 

There are several measures that could compute the similarity between a tag 
vector (T) and a reference vector (R) for a given resource r. They tend to be corre- 
lated, though. Regardless of the values given by the measures, we are interested 
in getting comparable values towards a way to determine whether a tag set re- 
sembles to a greater or lesser extent than another set. Thus, as a well-known and 
robust measure for this, we compute the cosine similarity between the vectors 
(see Equation 7.5). 



similarity,. = cos(0 r ) = jjjnjjj^ 
" T ri x R ri 



(7.5) 



The above formula provides the value of similarity between the tag vector 
and the reference vector of a single resource. This value is the cosine of the angle 
between the two vectors, which could range from to 1, since the term frequen- 
cies only consist of positive values. A value of 1 would mean that both vectors 
are exactly the same, whereas a would mean they coincide in none of the terms, 
and so they are completely different. After getting the similarity value between 
each pair of vectors, we need to get the overall similarity value between users' 
tags and descriptive data of resources. Accordingly, the similarity between the 
set of n reference vectors, and the set of n tag vectors is computed as the average 
of similarities between pairs of tag and reference vectors (see Equation 7.6). 

I n 

similarity = - ^cos(0 r ) (7.6) 
n r=l 

This similarity value shows the extent to which the tags provided by the se- 
lected set of users resembles the reference descriptive data, i.e., how descriptive 
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are the tags by those users. The higher is the similarity value, the more descriptive 
are the tags provided by the users. The closer it is to 0, the more non-descriptive 
are the tags provided by them. Accordingly this enables to compare same per- 
cents of tag assignments by Categorizers and Describers, and to analyze which of 
them provide more descriptive tags. 

7.4 Results 

Table 7.3 shows the performance of Categorizers (continuous line) and Describers 
(dashed line) on the classification task, whereas Table 7.4 on page 121 does the 
same for the descriptiveness experiments. The results are presented in different 
graphs organized by datasets in rows -Delicious, LibraryThing and GoodReads- 
, and by measures in columns -TRR, ORPHAN and TPR All of them keep the 
same scale and ranges for X axis, as well as for Y axis within each dataset, so that 
it enables an easy visual comparison of the results. When analyzing these results, 
we are especially interested in performance differences between Categorizers and 
Describers, and studying whether and why such subsets of users perform better 
for a certain task. Obviously, both Categorizers and Describers always yield the 
same performance for 100% sets, as we are considering the whole set of users. 

7.4.1 Categorizers Perform Better on Classification 

It stands out that all three measures get positive results for both classification 
and descriptiveness experiments on LibraryThing. The subsets of Categorizers 
perform better for classification in all cases for this dataset. This means that 
all three measures provide a good way to discriminate Categorizers. Among the 
compared measures, TPP gets the largest gap for classification, whereas TRR does 
it for descriptiveness. 

As regards to GoodReads, results are less consistent. TPP yields especially 
positive results on this dataset. With the other measures, TRR and ORPHAN, 
Describers outperform Categorizers for classification. However, TRR works well 
for the 10% subsets, suggesting that it discriminates correctly a subset of extreme 
Categorizers, but it fails when the subsets upsize. We speculate that the reason 
for this observation lies in the fact that this social tagging system is suggesting 
tags to users from their personomy. This encourages users to have a smaller 
vocabulary, and to reuse their tags frequently. It is quite easier to click on a list of 
tags than to type them. 

In the case of Delicious, it seems that the resource-based system suggestions 
of its settings make it more difficult to detect Categorizers. On the one hand, 
TRR and ORPHAN show really slight differences between Categorizers and De- 
scribers, so that their discrimination does not seem to be performed appropriately. 
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Table 7.3: Tag-based classification accuracy results for Categorizers (continuous 
lines) and Describers (dashes lines) on Delicious, LibraryThing and GoodReads. 
The X axis represents the percents of selected top users, ranging from 10% to 
100% with a step size of 10%, either for Categorizers or Describers, whereas Y 
axis represents the accuracy. 
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Table 7.4: Similarity measures of the descriptiveness of tags on Delicious, Library- 
Thing and GoodReads. Continuous lines correspond to Categorizers, whereas 
dashed lines are Describers. The X axis represents the percents of selected top 
users, ranging from 10% to 100% with a step size of 10%, either for Categoriz- 
ers or Describers, whereas Y axis represents the degree of similarity (i.e., cosine 
value) to descriptive data. 
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However, it works well with TPP for small subsets of users, where Categoriz- 
ers outscore Describers. This outperformance inverts for larger subsets of users, 
though. 

Analyzing the three datasets altogether, TPP shows the best way to discrimi- 
nate Categorizers as better contributors to the resource classification task. This is 
clear for GoodReads and LibraryThing, but it only happens for small subsets of 
users on Delicious. 



7.4.2 Describers Perform Better on Descriptiveness 

The results of the descriptiveness experiments show that Describers are always 
superior to Categorizers in this regard. All three measures show to be really use- 
ful for discriminating Describers among users on social tagging systems, regard- 
less of the settings of the site. Moreover, the measures show a similar behavior on 
all sites insofar as the outperformance of Describers as compared to Categorizers 
is fairly similar on the three datasets. However, the large gap of TRR sets it apart 
from the rest. Thereby, TRR gives rise to the best detection of Describers. 

7.4.3 Verbosity vs Diversity 

The three measures we have studied in this work rely on two different features 
to discriminate user behavior: verbosity and diversity. We can see a better overall 
performance of the TPP measure for resource classification, and the TRR measure 
for the descriptiveness task, we believe that: (1) verbosity can be inferred as the 
optimal feature for discriminating Categorizers, and (2) diversity as the feature 
that better discriminates Describers. In this context, we believe that Categorizers 
are thinking of a physical organization of resources, as librarians would do by 
placing books in shelves, when they annotate resources with tags. For instance, 
in the specific case of books, a user who thinks of the shelf where they stack 
their fictional books seems very likely to solely use the tag fiction. We could 
define these shelf-driven users as non-verbose. A user who adds just one tag 
has probably thought of the perfect tag that places it in the corresponding shelf. 
On the other hand, users who provide more detailed and diverse annotations 
rather think of describing the book instead of placing it in a specific shelf. This 
aspect makes the verbosity feature more powerful than the diversity feature for 
the detection of Categorizers. Thus, we believe that this is the feature that makes 
TPP so useful at discriminating Categorizers in search of an accurate resource 
classification as compared to TRR and ORPHAN, because it only relies on users' 
verbosity. 
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7.4.4 Non-descriptive Tags Provide More Accurate Classification 

When discriminating user behavior appropriately by using a verbosity-based 
measure like TPP, we have shown that Categorizers better fit the classification 
task, whereas Describers provide annotations that further resemble the descrip- 
tive data. An interesting deduction from here is that a set of annotations that 
differs to a greater extent from the descriptive data produces a more accurate 
classification of the books. From this, we infer that Describers are using more de- 
scriptive tags, whereas Categorizers rather use non-descriptive tags. Hence, users 
who do not think of providing annotations in a similar way to writing reviews 
rely on non-descriptive tags, yielding a more accurate classification of the books. 

7.5 Conclusion 

In this chapter, we have explored the detection of user behavior on social tag- 
ging systems in search of users who rather approach to the resource classification 
task. To this end, we have explored the measures presented by Strohmaier et al. 
(2010a), which help us determine whether a user is a Categorizer rather organiz- 
ing resources, or they are a Describer rather detailing the content of the resources. 
Specifically, we have studied the application of three different measures -TRR, 
ORPHAN and TPP-, which rely on two main features: verbosity and diversity 
of user annotations. By means of choosing different subsets of Categorizers and 
Describers, we have performed experiments on (1) resource classification, in or- 
der to explore whether Categorizers further resemble the classification by experts, 
and (2) measurement of the descriptiveness of tags, for exploring whether there 
is a higher similarity between tags by Describers and descriptive data of the re- 
sources. 

Besides further understanding the existence of users aiming at classification 
on social tagging systems, i.e., Categorizers, we complemented a previous work 
by Korner et al. (2010a), where the authors showed that Describers are a good 
source for inferring semantic relations from folksonomies. 

Parts of the research in this chapter have been published in Zubiaga et al. 
(2011b). 

We have answered the following research questions: 

Research Question 9 

Can we discriminate different user profiles so that we can find a subset of users 
who provide annotations that better fit a classification scheme? 

We have shown that such type of user, so-called Categorizer, actually exists. 
Tags assigned by Categorizers provide a more accurate classification of resources 
than those assigned by another set of users so-called Describers. According to 
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our experiments, this is mostly true for systems without tag suggestions, i.e., 
LibraryThing, where the resource classification performed with tags by Catego- 
rizers yields clearly better results. When such suggestions exist, the detection of 
suitable users becomes more difficult, as we have showed happens on GoodReads 
and Delicious. However, the application of an appropriate measure by consid- 
ering suitable features can produce a successful selection of users who fit the 
characteristics of a Categorizer. 

Research Question 10 

What are the features that identify a user as a good contributor to the resource 
classification 1 

We have analyzed two features that characterize users of social tagging sys- 
tems: verbosity, and diversity. We have shown that the level of verbosity helps 
discover Categorizers, who are better suited for the classification task. The vocab- 
ulary diversity is useful to find Describers, who tend to annotate using descriptive 
tags. Moreover, we have shown that users who do not rely on descriptive data 
provide better classification metadata than those who use descriptive tags. 
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"Believe those who are seeking the truth. Doubt those who find it." 
— Andre Gide 

We conclude the thesis in this chapter. Next, we summarize the 
main contributions to the research field in Section 8.1. We con- 
tinue by answering the formulated research questions in Sec- 
tion 8.2 on page 127. Finally, in Section 8.3 on page 131 we 
present an outlook on future directions of the research work in 
this thesis. 

8.1 Summary of Contributions 

The novel idea of this work lies in the use of social annotations for carrying out 
a resource classification task. To the best of our knowledge, the first research 
work performing real classification experiments using social annotations is our 
first work in the field (Zubiaga et al., 2009d). Prior to that, only Noll and Meinel 
(2008a) had performed a statistical analysis comparing social tags to a classifica- 
tion performed by experts. Taking into account the lack of work in the field, the 
work comprised in this thesis sheds new light on the appropriate use and repre- 
sentation of social tags for resource classification. More specifically, the following 
are the main contributions of this work: 

• We have created 3 large-scale social tagging datasets, including classifica- 
tion metadata of the annotated resources. These are among the largest 
datasets used so far for research and, to the best of our knowledge, the 
largest used for resource classification experiments. Some of these datasets, 
along with other smaller datasets we created, have been made publicly 
available for research purposes 1 . Godoy and Amandi (2010) and Strohmaier 

1 http:/ / nlp.uned.es/ social-tagging/ datasets/ 
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et al. (2010b), for instance, have used some of our datasets in their recent 
research works. Even after we created these social tagging datasets, and 
made publicly available parts of them, little work has been done on cre- 
ating and releasing more datasets. In Korner and Strohmaier (2010), the 
authors present a list of publicly available social tagging datasets, among 
which our datasets are also included. However, the authors set out the 
problem of the unavailability of more datasets, and encourage researchers 
to create and release new ones. As far as we know, no additional datasets 
have been released subsequently including categorization data for tagged 
resources. 

• Our work is the first comparing different representations of resources based 
on social tags for resource classification. Moreover, it is the first work per- 
forming actual classification experiments comparing social tags to other 
data sources. We have shown that social tags are also useful for classifica- 
tion upon narrower categories in deeper levels of taxonomies. In a previous 
work, Noll and Meinel (2008a) perform a statistical study concluding that 
social tags may not be helpful for narrower categories. In contrast to this, 
we have performed actual classification experiments showing a larger im- 
provement for narrow categorization as compared to other data sources. 

• We have analyzed the distributions of social tags in folksonomies, and per- 
formed a thorough study on how the settings of each social tagging system 
affect them, and therefore, a resource classification task. In this regard, we 
have applied a consolidated weighting scheme, TF-IDF, to the new social 
data structure given by folksonomies. 

• We have shown the existence of a group of users, so-called Categorizers, 
whose annotations more closely resemble the classification performed by 
experts than social tags provided by another group of users known as De- 
scribers. The approach of differentiating Categorizers from Describers was 
already tested and verified in earlier works by proving the suitability of the 
latter for inferring semantic relations from folksonomies. Going further, we 
have demonstrated the suitability of Categorizers for resource classification. 

The use of social annotations for the sake of resource classification tasks was 
a novel research line in the beginning of this thesis. However, the increasing 
interest of researchers on user-generated content in social media, and specifically 
in social tagging systems, has recently brought about more work in the field. 
Along with this increase, more researchers have shown their interest in the use 
of social annotations for resource classification tasks, and the number of works 
in this field has increased. Godoy and Amandi (2010), for instance, perform a 
tag-based classification study inspired by our earlier work (Zubiaga et al., 2009d). 
Furthermore, Aliakbary et al. (2009), Yin et al. (2009), Xia et al. (2010), and Lu 
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et al. (2010) have recently presented their research in related matters, making use 
of social tags as to resource classification. 

8.2 Answers to Research Questions 

At the beginning of this work, we set forth the following problem statement 
summarizing the main goal of the thesis: 

Problem Statement 

How can the annotations provided by users on social tagging systems be exploited 
to yield the most accurate resource classification task? 

In order to solve this problem statement, we split it into 10 research questions. 
Next, we list those research questions along with answers to them: 

Research Question 1 

What kind of SVM classifiers should be used to perform this kind of classification 
tasks: a native multiclass classifier, or a combination of binary classifiers? 

We have shown the clear superiority of the native multiclass SVM classifiers 
over the other approaches combining binary classifiers. Our results show that 
relying on a set of binary classifiers is not a good option when it comes to mul- 
ticlass taxonomies. Accordingly, native multiclass classifiers, which consider all 
the classes at the same time and have more knowledge of the whole task, perform 
much better. 

Research Question 2 

What kind of learning method performs better for this kind of classification tasks: 
a supervised one, or a semi-supervised one? 

Semi-supervised approaches may perform better when the labeled subset is 
really small, but supervised approaches, which are computationally less expen- 
sive, perform similarly with more labeled documents. Therefore, we have also 
shown that, unlike binary tasks as shown by Joachims (1999), a supervised ap- 
proach performs very similar to a semi-supervised approach on these environ- 
ments. It seems reasonable that predicting the class of uncategorized documents 
is much more difficult when the number of classes increases, and so the miscate- 
gorized documents are harmful for classifier's learning. 

Thereby, according to these two conclusions above, we decided to use a su- 
pervised multiclass SVM approach. 

Research Question 3 

How do the settings of social tagging systems affect users' annotations and the 
resulting folksonomies ? 
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To this end, we have analyzed several features that can be found in different 
settings of social tagging systems. Among the analyzed features, we have shown 
the impact of tag suggestions, which considerably alters the resulting folksonomy. 
In the studied social tagging sites, all of them differ on the settings regarding 
suggestions: 

• Resource-based suggestions (Delicious): when the system suggests tags 
assigned by other users to the resource at the time of bookmarking it, the 
likelihood of using new tags to further describe such a resource descreases. 
In this case, users provide less originality and tend to rely on system sug- 
gestions. 

• Personomy-based suggestions (GoodReads): when the system suggests 
tags previously used by the user, the vocabulary in their personomy tends 
to be much smaller. However, users do not know how others annotated a 
resource, and thus they are likely to provide new tags to the resource. 

• Without suggestions (LibraryThing): when the system does not suggest 
any tags to the user, the vocabulary in their personomy increases, as well 
as the diversity of tags in each resource. 

Research Question 4 

What is the best way of amalgamating users' aggregated annotations on a resource 
in order to get a single representation for a resource classification task? 

We have shown that it is worthwhile considering all the tags annotated on 
a resource instead of those in the top that were annotated most. Tags in the 
top are the most important, and give the main information on the aboumess 
of resources. However, tags in the tail are helpful to a lesser extent, providing 
meaningful information and improving the performance of the classifier. 

Regarding the weights assigned to those tags when representing a resource, 
the number of users annotating each tag should be considered in order to get the 
best results. This is the value that has shown the best results in our experiments. 
It has outperformed other approaches ignoring weights or considering other data 
such as the total number of users annotating the resource. 

Thereby, the best representation in our experiments is the one that includes all 
the tags with the values corresponding to the number of users annotating them. 

Research Question 5 

Despite of the usefulness of social tags for these tasks, is it worthwhile considering 
their combination with other data sources like the content of the resource as an 
approach to improve the results even more? 

By means of classifier committees, which combine the predictions by differ- 
ent classifiers, we have shown that tags provide reliable prediction criteria to take 
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into consideration. SVM classifiers not only predict a category, but also assign a 
weight to each category based on the given resource. These weights, given in the 
form of margin values, can be used by other classifiers which rely on different 
data sources. Adding up weights provided by different classifiers can help predict 
the correct category when a single classifier fails to categorize the resource ap- 
propriately. Weights provided by classifiers relying on social tags are especially 
useful when combining them with results from other classifiers. Nonetheless, 
not all data sources are helpful for combination in classifier committees, and the 
selected data source must be solid enough and provide reliable predictions to 
outperform the sole use of tags. When data sources are selected appropriately, 
the performance improvement can be considerable. We have shown that this 
varies among datasets. For example, with the Delicious dataset, it is important 
to analyze all three data sources (content, reviews, and tags). However, with the 
LibraryThing and GoodReads datasets reviews and tags suffice. 

Research Question 6 

Are social tags also useful and specific enough to classify resources into narrower 
categories as in deeper levels of hierarchical taxonomies? 

We have analyzed the usefulness of social tags for classification on two differ- 
ent levels of hierarchical taxonomies. Besides broader categories in the top level, 
we have also explored the classification on narrower categories in the second 
level. In this regard, social tags have shown to outperform the other data sources 
on social tagging sites that encourage users to annotate resources (Delicious and 
LibraryThing). Tags show clear outperformance in these cases, especially on De- 
licious, where the difference is even more favorable in the second level. This 
difference is very similar on LibraryThing. Finally, tags from GoodReads do not 
outperform other data sources at any level because the system does not encourage 
users to tag books, so that many bookmarks are not annotated. 

Our findings provide a different conclusion from that by Noll and Meinel 
(2008a), where the authors pointed out the hypothesis that social tags were prob- 
ably useless for deeper levels of taxonomies, and alternative data should be used 
instead. However, the authors performed just a statistical analysis, and did not 
confirm the hypothesis with real experiments. 

Research Question 7 

Can we further consider the distribution of tags across the collection so that we can 
measure the overall representativity of each tag to represent resources? 

We have analyzed the suitability of IDF-like weighting functions to define 
the representativity of tags, which consider the distribution of tags through the 
whole collection of resources. Our experiments have shown that these functions 
helps improve performance of a resource classification task. However, we have 
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shown that the settings of the social tagging system have an effect on those dis- 
tributions. Resource-based tag suggestions have shown to influence the structure 
of folksonomies greatly. Suggesting tags based on previous annotations of others 
on the resource causes a very different tag distribution, which in turn, affects the 
results of the weighting function. When a system enables the resource-based tag 
suggestions, the use of tag weighting functions performs worse, and combining 
with other data sources is required to improve performance; this method can 
even outperform the TF-based approach. 

For our classification experiments, we have found that IDF-like weighting 
functions clearly outperform the TF approach when resource-based tag sugges- 
tions are not enabled, i.e., on LibraryThing and GoodReads, both when used on 
their own, or when combined with other data sources. We found it better to 
consider just the tag-based approach, without combining them with other data 
sources, since it provides superior results, which cannot be improved by combin- 
ing them with other predictions. 

Research Question 8 

What is the best approach to weigh the representativity of tags in the collection for 
resource classification? 

Among the studied weighting functions, the one relying on bookmark fre- 
quencies has shown to be the best when there are no resource-based tag sugges- 
tions. In these cases, IBF performs the best, followed by IRF, and IUF All of them 
clearly outperform TF, when both used on their own, and combined with other 
data sources using classifier committees. 

On the other hand, when the social tagging system suggests tags to the user 
relying on the resource itself, IUF performs better than the others. IUF performs 
better than IBF and IRF, because of the importance of the ability of users to choose 
their own tags without relying on suggestions from these systems. Even though 
IUF does not outperform TF when used on its own, combining it with other data 
sources produces the best approach. However, it is only slightly better than the 
committees relying on TF, and any of them can be used to score similar results. 

Research Question 9 

Can we discriminate different user profiles so that we can find a subset of users 
who provide annotations that better fit a classification scheme? 

We have shown that such type of user, so-called Categorizer, actually exists. 
Tags assigned by Categorizers provide a more accurate classification of resources 
than those assigned by another set of users so-called Describers. According to 
our experiments, this is mostly true for systems without tag suggestions, i.e., 
LibraryThing, where the resource classification performed with tags by Catego- 
rizers yields clearly better results. When such suggestions exist, the detection of 
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suitable users becomes more difficult, as we have showed happens on GoodReads 
and Delicious. However, the application of an appropriate measure by consid- 
ering suitable features can produce a successful selection of users who fit the 
characteristics of a Categorizer. 

Research Question 10 

What are the features that identify a user as a good contributor to the resource 
classification? 

We have analyzed two features that characterize users of social tagging sys- 
tems: verbosity, and diversity. We have shown that the level of verbosity helps 
discover Categorizers, who are better suited for the classification task. The vocab- 
ulary diversity is useful to find Describers, who tend to annotate using descriptive 
tags. Moreover, we have shown that users who do not rely on descriptive data 
provide better classification metadata than those who use descriptive tags. 

8.3 Future Directions 

The use of social tags for resource classification is still a novel research field with 
little work done so far. The thesis has shown how social tags can be useful for the 
resource classification task, and provides analysis to help determine an optimal 
method to accurately categorize resources based on their social tags. Further- 
more, this thesis paves way for future research on the utilization of social tags for 
resource classification. 

Throughout this thesis, we have considered each tag as a different token, 
regardless of its semantic meaning. In this regard, future work includes analyzing 
the meaning of each tag trying to discover synonymous words, and relations 
among them. Either by using natural language processing methods or following 
ontology-based approaches, it could improve understanding the meaning of each 
tag and further exploring the knowledge provided by folksonomies. 

The three weighting schemes we have used in Chapter 6 on page 95 rely on 
the classical TF-IDF function designed for text collections. Trying other weight- 
ing functions, as well as defining a new one that fits the structure of folksonomies 
would be also interesting as a future work. This would especially help for sys- 
tems providing resource-based tag suggestions, like Delicious, where the tested 
weighting schemes did not perform well. 
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Additional Results 



In Chapter 5 on page 75 we explored different representations of social tags in 
order to evaluate which of them performs better on a resource classification task. 
Among the approaches, we compared using all the tags annotated on each re- 
source, and choosing just those in the top. For the latter, we focused on the top 
10 tags, just to evaluate whether tags in the tail were harmful for this purpose. 
However, we did not show whether a selection of top 5 or 15 of tags could be a 
better choice. In Table A.l on the next page we show the results of using different 
tops of tags for the FTA-based representation on the top level of the taxonomies. 
The results confirm that relying on all the tags performs the best, and that the 
selection of 5, 10 or 15 tags in the top has no impact in this regard. Going further, 
it also confirms that tags in the tail are far less useful, because the improvement 
is much smaller when low-ranked tags are included. 
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Key Terms and Definitions 



Next, we list and provide the definitions for some of the most relevant terms 
related to this thesis, which help to better understand social tagging systems: 

Tagging Tagging is an open way to assign tags or keywords to resources or items 
(e.g., web pages, movies or books), in order to describe them. This enables 
the later retrieval of the resources in an easier way, using tags as resource 
metadata. As opposed to a classical taxonomy-based categorization system, 
they are usually non-hierarchical, and the vocabulary is open, so it tends 
to grow indefinitely. For instance, a user could tag this thesis as social- 
tagging, research and thesis, whereas another user could use web2.0, social- 
bookmarking and tagging tags to annotate it. 

Social tagging A tagging system becomes social when its tag annotations are 
publicly visible, and profitable for anyone. The fact of a tagging system 
being social implies that a user could take advantage of tags defined by 
others to retrieve a resource. 

Social bookmarking Delicious, StumbleUpon and Diigo, amongst others, are 
known as social bookmarking sites. They provide a social means to save 
web pages (or other online resources like images or videos) as bookmarks, 
in order to retrieve them later on. In contrast to saving bookmarks in user's 
local browser, posting them to social bookmarking sites allows the commu- 
nity to discover others' links and, besides, to access the bookmarks from 
any computer to the user itself. In these systems, bookmarks represent 
references to web resources, and do not attach a copy of them, but just a 
link. Note that social bookmarking sites do not always rely on social tags 
to organize resources, e.g., Reddit is a social bookmarking approach to add 
comments on web pages instead of tags. However, the use of social tags in 
social bookmarking systems is a common approach. 
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Social cataloging They are quite similar to social bookmarking sites in that re- 
sources are socially shared but, in this case, offline resources like music, 
books or movies are saved. For instance, LibraryThing allows to save the 
books you like, Hulu does it for movies and TV series, and Last.fm for 
music-related resources. As in social bookmarking sites, tags are the most 
common way to annotate resources in social cataloging sites. 

Folksonomy As a result of a community tagging resources, the collection of tags 
defined by them creates a tag-based organization, so-called folksonomy. A 
folksonomy is also known as a community-based taxonomy, where the clas- 
sification scheme is plain, there are no predefined tags, and therefore users 
can freely choose new words as tags. A folksonomy is basically known as 
weighted set of tags, and may refer to a whole collection/site, a resource 
or a user. A summary of a folksonomy is usually presented in the form of 
a tag cloud. 

Personomy Personomy is a neologism created from the term folksonomy, and it 
refers to the weighted set of tags of a single user /person. It summarizes 
the topics a user tags about. 

Simple tagging users describe their own resources or items, such as photos on 
Flickr, news on Digg or videos on Youtube, but nobody else tags another 
user's resources. Usually, the author of the resource is who tags it. This 
means no more than one user tags an item. In many cases, like in Flickr and 
Youtube, simple tagging systems include an attachment to the resource, 
and not just a reference to it. 

Collaborative tagging many users tag the same item, and every person can tag it 
with their own tags in their own vocabulary. The collection of tags assigned 
by a single user creates a smaller folksonomy, also known as personomy. As 
a result, several users tend to post the same item. For instance, CiteULike, 
LibraryThing and Delicious are based on collaborative tagging, where each 
resource (papers, books and URLs, respectively) could be annotated by all 
the users who considered it interesting. 
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List of Acronyms 



This is a list of acronyms used in this thesis: 

API Application Programming Interface 

DDC Dewey Decimal Classification 

FTA Full Tagging Activity 

HTML Hyper Text Markup Language 

IBF Inverse Bookmark Frequency 

IDF Inverse Document Frequency 

IRF Inverse Resource Frequency 

IUF Inverse User Frequency 

LCC Library of Congress Classification 

ODP Open Directory Project 

ORPHAN Orphan Ratio 

TPP Tags Per Post 

TRR Tag Resource Ratio 

URL Uniform Resource Locator 

SVM Support Vector Machines 

S 3 VM Semi-Supervised Support Vector Machines 

TF Term Frequency 

VSM Vector Space Model 
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Resumen (Spanish Summary) 



El experimentador que no sabe lo que estd buscando no comprenderd lo que encuentra. 
— Claude Bernard 

Utilizacion de Folksonoimas para 
Clasificacion de Recursos 

En esta tesis abordamos el problema de la clasificacion automa- 
tica de recursos, una tarea cada vez mas importante en nuestra 
vida diaria. El catalogado de libros o la organization de videos, 
entre otros, representan algunos ejemplos de actividades para 
las que un proceso automatico de clasificacion resulta cada vez 
mas necesario e importante en nuestro dia a dia. En esta tesis 
aprovechamos la information contenida en las anotaciones que 
realizan los usuarios de sistemas de etiquetado social, en los 
cuales se recogen metadatos que detallan el contenido de dife- 
rentes tipos de recursos, para mejorar la clasificacion. Hasta el 
momento, son pocos los trabajos que han explotado estos meta- 
datos con este fin, y los pocos que lo han hecho se han limitado a 
realizar analisis estadisticos. En esta tesis exploramos las carac- 
teristicas de estos sistemas de etiquetado social y de los usuarios 
involucrados en ellos, asi como de las anotaciones que aportan, 
siempre con el fin de sacar el maximo partido a estas grandes 
colecciones, obteniendo asi el mayor rendimiento posible para 
un clasificador automatico de recursos. 
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D.l Motivacion 

Organizar recursos dentro de categorias supone una tarea muy comun en nues- 
tro dia a dia. Tener recursos asignados a categorias predefinidas siempre ayuda 
a mejorar posteriores accesos a la informacion contenida en ellos, ya que este ac- 
ceso puede limitarse entonces a un conjunto reducido de categoria(s) deseada(s). 
Por ejemplo, los bibliotecarios suelen catalogar los libros por temas, de forma que 
quedan organizados por intereses similares. Las bases de datos de peliculas, los 
catalogos de musica y los sistemas de ficheros, entre otros, suelen estar organiza- 
dos tambien por categorias, de forma que se facilita su acceso future Asimismo, 
la clasificacion de paginas web resulta una tarea de especial interes a la hora de 
mejorar los resultados provistos por los motores de busqueda, ya que ayudan a 
reducir el ambito de esta busqueda a la categoria deseada por el usuario. Direc- 
tories web como Yahoo! Directory y Open Directory Project organizan paginas 
web en categorias, ofreciendo una alternativa o complemento a la busqueda por 
palabra(s) clave(s). 

El problema entonces esta en lo costosa y cara que resulta la clasificacion ma- 
nual de estos recursos cuando la coleccion es grande. Por ejemplo, The Library 
of Congress de Estados Unidos informo en 2002 de que el coste medio de cata- 
logacion de cada registro bibliografico por profesionales fue de 94,58 dolares 1 . 
Catalogar 291.749 registros, como hicieron en aquel ano, les llego a costar unos 
27 millones y medio de dolares. Dado lo cara que resulta la categorizacion ma- 
nual, la utilization de clasificadores automaticos puede ser una buena alternativa 
para reducir su coste, y asimismo mantener los catalogos al dia con un esfuerzo 
humano menor. 

Hasta el momento, la mayoria de los clasificadores automaticos se han cen- 
trado en el contenido de los recursos a la hora de representarlos, sobre todo en 
tareas de clasificacion de paginas web (Qi and Davison (2009)). No obstante, la 
falta de datos representatives en el contenido de muchos de ellos hace que se 
complique esta tarea. Ademas, puede resultar muy complicado obtener suficien- 
tes datos sobre tipos de recursos como libros o peliculas, para los cuales puede 
ser mas complicado representar el contenido o, incluso, puede que el contenido 
no este disponible en una forma que pueda ser procesado. 

Como solution a este problema, los sistemas de etiquetado social proveen 
una forma sencilla y barata de obtener metadatos sobre recursos. Sistemas como 
Delicious 2 , LibraryThing 3 y GoodReads 4 recopilan anotaciones de usuarios en 
forma de etiquetas para grandes colecciones de recursos. Estas etiquetas provistas 

1 http: / / www.loc.gov/loc/lcib / 0302/ collections.html 

2 http:/ /delicious. com 

3 http: / / www.librarything.com 

4 http: / / www.goodreads.com 
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por usuarios dan lugar a datos significativos que describen el contenido de los 
recursos (Heymann et al., 2008). 

Por medio de estas etiquetas, los usuarios proveen una especie de organiza- 
tion propia de los recursos. Estas etiquetas se comparten de forma social con la 
comunidad, y gracias a que un gran numero de usuarios contribuye en estos sis- 
temas, son numerosas las anotaciones que se acumulan sobre cada recurso. Por 
lo tanto, esa acumulacion hace que cada una de las anotaciones sea mas util. Asi, 
la acumulacion de usuarios en una comunidad activa genera un gran numero de 
marcadores, etiquetas, y por tanto, recursos anotados. 

"Cada una de las categorizaciones individuales vale menos que la categori- 
zation de un profesional. Pero hay muchas, muchas de aquellas." , Joshua 
Schachter, fundador de Delicious, en la cumbre FOWA 2006 FOWA 
en Londres (Inglaterra) 5 . 

Los sistemas de etiquetado social representan un medio para guardar, organi- 
zar y buscar recursos, todo ello por medio de la anotacion con etiquetas escogidas 
por el usuario. Como hipotesis principal de este trabajo, creemos que estas gran- 
des colecciones de anotaciones pueden mejorar de forma considerable una tarea 
de clasificacion de recursos. Dicho de otro modo, las anotaciones provistas por 
usuarios podrian llegar a ser muy utiles como una fuente de datos que aporta 
information significativa que podria ayudar a inferir la categoria de los recursos. 

Dado que un gran numero de usuarios provee sus propias anotaciones so- 
bre cada recurso, nuestro objetivo entonces se centra en descubrir la manera de 
amalgamar esas aportaciones en busca de una organization que se parezca a la 
categorization realizada por profesionales. En este contexto, donde los usuarios 
aportan grandes cantidades de metadatos, nuestro reto se centra en sacar el maxi- 
mo partido de ellos con el fin de mejorar el rendimiento de la tarea de clasificacion 
de recursos. 

"Estamos en una epoca en la que los datos son baratos, pero sacar parti- 
do de ellos no lo es" , Danah Boyd, Investigadora sobre Social Media 
en Microsoft Research New England, en el congreso WWW2010 en 
Raleigh, Carolina del Norte, Estados Unidos 6 . 

D.l.l Clasificacion de Recursos 

La clasificacion de recursos se puede definir como la tarea consistente en la or- 
ganization de recursos dentro de un conjunto de categorias predefinidas. En este 
trabajo utilizamos las Maquinas de Vectores de Soporte (SVM, Joachims (1998)), 

5 http:/ / simonwillison.net/2006/Feb/ 8/ summit/ 
6 http://www.danah.org/papers/talks/2010/WWW2010.html 
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un metodo vanguardista para clasificacion que ha destacado por sus buenos re- 
sultados desde finales de los anos 90. Este algoritmo de clasificacion se basa en 
el analisis de un conjunto de instancias previamente categorizadas, con lo que se 
alimenta el clasificador para que adquiera el conocimiento necesario para poder 
clasificar posteriormente nuevos recursos. 

Un problema de clasificacion de recursos puede definirse a partir de diferen- 
tes caracteristicas. Por una parte, en lo que se refiere al metodo de aprendizaje, 
puede ser supervisado, donde todo el conjunto de entrenamiento esta previamen- 
te categorizado, o semisupervisado, donde tambien se aprovechan instancias sin 
information de categoria durante la fase de aprendizaje. Por otra parte, consi- 
derando el numero de clases, la clasificacion puede ser binaria, cuando solo hay 
dos categorias que pueden ser asignadas a cada recurso, o multiclase, cuando hay 
tres o mas categorias. El primer caso se utiliza habitualmente para sistemas de 
filtrado, mientras que el segundo suele ser frecuente en el caso de taxonomias 
mayores, como en el caso de la clasificacion tematica de recursos. 

Para clasificacion tematica sobre grandes colecciones de recursos, como pagi- 
nas web en la Web o libros en bibliotecas, las taxonomias suelen estar definidas 
por mas de dos categorias, y el subconjunto de recursos previamente categoriza- 
do suele ser muy pequeno. De esta manera, creemos que se deberia considerar y 
analizar la aplicacion de tecnicas semisupervisadas y multiclase para este tipo de 
tareas. 

Por ello, en esta tesis proponemos inicialmente el analisis de varias tecnicas 
de clasificacion que utilizan SVM, con el fin de analizar su adecuacion a estas 
tareas. Estas tecnicas incluyen diferentes aproximaciones a la resolution de tareas 
multiclase, asi como algoritmos supervisados y semisupervisados. 

D.1.2 Anotaciones Sociales 

Los sistemas de etiquetado social permiten a sus usuarios guardar y anotar sus 
recursos favoritos (como por ejemplo paginas web, peliculas, libros, fotos o mu- 
sica), compartiendolos a su vez con la comunidad. Los usuarios proveen estas 
anotaciones normalmente en forma de etiquetas. Se conoce como etiquetado a 
la forma abierta de asignar etiquetas o palabras clave a recursos, de manera que 
se pueden describir y organizar. Esto posibilita la posterior recuperation de los 
recursos de forma mas sencilla, aprovechando las etiquetas como metadatos que 
los describen. Normalmente, no hay etiquetas predefinidas, y por lo tanto los 
usuarios pueden escoger libremente las palabras que deseen como etiquetas. 

"El etiquetado es principalmente una Interfax de usuario - una manera para 
que la gente recuerde cosas, en que estaban pensando en el momento en el 
que lo guardaron. Bastante util para recordar, bueno para el descubrimiento, 
terrible para la distribucion (donde los que lo publican anaden tantas eti- 
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quetas como pueden para incluirlo en el mayor numero posible de cajas).", 
Joshua Schachter, fundador de Delicious, en la cumbre FOWA 2006 
FOWA en Londres (Inglaterra) 7 . 

Mediante este proceso se genera una estructura de etiquetas conocida como 
folksonomia, es decir, una organization de recursos dirigida por usuarios. Folkso- 
nomia es una contraction de las palabras folk (gente), taxis (clasificacion) y nomos 
(gestion). Es conocida tambien como una taxonomia basada en los usuarios, en 
la cual la estructura no es jerarquica, al contrario que una clasificacion taxonomi- 
ca basica. Por lo tanto, una folksonomia tiene cierta relation con las taxonomias 
generadas por expertos, en cuanto a que los recursos se organizan igualmente en 
grupos. 

Se dice que estas anotaciones pertenecen a un entorno social cuando estan ac- 
cesibles y utilizables para cualquier usuario. Esta caracteristica posibilita la bus- 
queda de recursos aprovechando las anotaciones aportadas por otros. A su vez, 
es uno de los motivos que anima a los usuarios a contribuir. 

No obstante, no todas las anotaciones se comparten de la misma manera. 
El propio sistema de etiquetado social puede definir algunas restricciones a es- 
te respecto, principalmente estableciendo quien tiene permiso para anotar cada 
recurso. En este sentido, se pueden distinguir dos tipos de sistemas (Smith, 2008): 

• Sistemas de etiquetado simple: los usuarios pueden describir sus propios 
recursos, como es el caso del etiquetado de fotos en Flickr 8 , noticias en 
Digg 9 o videos en Youtube 10 , pero nadie anota los recursos de otros. Ge- 
neralmente, el autor del recurso es quien lo anota. Esto significa que no 
mas de un usuario puede etiquetar cada recurso. Mas formalmente, en un 
sistema de etiquetado simple hay un conjunto de usuarios (U) que anota 
unos recursos (R) con unas etiquetas (T). Cada usuario u, G U puede guar- 
dar un recurso Tj E R con un conjunto de etiquetas T; = {tn, ...,tj p }, con 
un numero p variable de etiquetas. El conjunto de etiquetas asignado a r; 
seguira estando limitado a Tj, ya que nadie mas lo podra anotar. 

• Sistemas de etiquetado colaborativo: muchos usuarios pueden anotar ca- 
da recurso, y todos ellos pueden etiquetarlo con su propio vocabulario. El 
conjunto de etiquetas asignado por un usuario genera una folksonomia a 
menor escala, conocida como personomia. Como resultado, varios usuarios 
tienden a anotar el mismo recurso. Por ejemplo, CiteULike.org, LibraryT- 
hing.com y Delicious se basan en anotaciones colaborativas, donde cada 
recurso (articulos, libros y URLs, respectivamente) puede ser anotado y 

7 http:/ / simonwillison.net/2006/Feb/ 8/ summit/ 
8 http:/ /www.flickr.com 
9 http:/ / digg.com 
10 http:/ /www. youtube.com 
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etiquetado por todos aquellos usuarios que lo consideren interesante. Por 
tanto, los sistemas de etiquetado colaborativo son algo mas complejos, ya 
que hay un conjunto de usuarios (U) que guarda sus marcadores (B) sobre 
unos recursos (R) anotandolos con unas etiquetas (T). Cada usuario w, 6 U 
puede guardar un marcador bu 6 B de un recurso Tj £ R con un conjun- 
to de etiquetas Tu — {f;yi, typ}, con un numero p variable de etiquetas. 
Despues de que k usuarios guardan Tj, se describe como un conjunto pesa- 
do de etiquetas T; = {w;itji,...,Wj n tj n }, donde W;\,...,W; n < k representan 
el numero de asignaciones de cada etiqueta. Por lo tanto, cada marcador 
esta compuesto por la tripleta de un usuario, un recurso y un conjunto 
de etiquetas: by : U\ x rs x T«. Asl, cada usuario guarda marcadores de 
diferentes recursos, y cada recurso tiene marcadores correspondientes a 
diferentes usuarios. El resultado de acumular etiquetas contenidas en los 
marcadores de un usuario se conoce como la personomla de ese usuario: 
T, = {wiihli w imUm}i donde m es el numero de etiquetas diferentes en la 
personomia del usuario. 

La Figura D.l muestra un ejemplo comparativo de ambos tipos de sistemas. 




Figura D.l: Comparacion de anotaciones provistas por usuarios en sistemas de 
etiquetado simple y colaborativo. 

En esta tesis nos centramos en sistemas de etiquetado colaborativo. General- 
mente, las etiquetas asociadas a un recurso tienden a coincidir entre usuarios, 
haciendo de esta coincidencia algo especialmente util en comparacion con las 
etiquetas que encontramos en sistemas de etiquetado simple. 

En un sistema de etiquetado colaborativo, como ejemplo, un usuario podria 
etiquetar este trabajo como etiquetado-social, investigacion y tesis, 
mientras que otro usuario podria utilizar las etiquetas etiquetado-social, 
marcadores-sociales, doctorado y tesis para anotarlo. El comportamien- 
to de los usuarios puede diferir de forma considerable en estos sistemas, donde la 
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acumulacion de sus anotaciones se suele considerar como consenso. Por ejemplo, 
el resultado de la acumulacion mediante suma de las anotaciones de arriba serfa 
el siguiente: tesis (2), etiquetado-social (2), marcadores-sociales (1), 
doctorado (1) e investigacion (1). 

En esta tesis, analizamos y estudiamos las anotaciones provistas por usuarios 
en sistemas de etiquetado social. Presentamos un estudio con el fin de sacar el 
maximo partido de ellas, con vistas a mejorar el rendimiento de una tarea de cla- 
sificacion de recursos. Concretamente, nos centramos en el analisis de la utilidad 
de las folksonomfas generadas por usuarios como aproximacion a una organiza- 
tion parecida a las taxonomias creadas por expertos. En este contexto, estudiamos 
diferentes representaciones basadas en el uso de anotaciones sociales, en busca de 
una aproximacion que se parezca a la clasificacion provista por expertos en la ma- 
yor media posible. Nos centramos en obtener el maximo de las etiquetas sociales, 
tanto buscando la mejor representation, como midiendo el impacto que puede 
tener en este sentido la distribution de las etiquetas sobre recursos, marcadores y 
usuarios. Finalmente, tambien estudiamos la aplicacion de tecnicas vanguardistas 
de analisis del comportamiento de los usuarios en estos sistemas, con el fin de 
detectar usuarios cuyas anotaciones esten mas proximas a la clasificacion creada 
por expertos. 

D.2 Objetivos 

El objetivo principal de esta tesis se centra en aportar nuevo conocimiento sobre 
el uso apropiado de la gran cantidad de datos que se pueden encontrar en los 
sistemas de etiquetado social. Dado el interes en clasificar recursos, y la falta de 
datos representatives, nos centramos en analizar en que medida y de que mane- 
ra las etiquetas sociales pueden mejorar la tarea de clasificacion de recursos. Al 
comienzo de este trabajo comprobamos que no habia investigaciones que aborda- 
ran este problema; por lo tanto, nos motivo a llevar a cabo esta investigacion. Con 
este fin, hemos definido el siguiente planteamiento del problema, el cual resume 
el objetivo principal de esta tesis: 

Planteamiento del Problema 

I Como se pueden aprovechar las anotaciones provistas por usuarios en sistemas de 
etiquetado social de forma que se obtenga una clasificacion de recursos mas precisa? 

D.3 Metodologfa 

La metodologfa de investigacion seguida a lo largo del trabajo se compone de las 
siguientes 6 partes: 
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1. Revision y lectura del estado del arte, asi como estudiar y comprender 
detalladamente el funcionamiento de los sistemas de etiquetado social. 

2. Busqueda de un clasificador SVM apropiado para llevar a cabo la investi- 
gation. 

3. Busqueda de colecciones existentes con information extralda de sistemas 
de etiquetado social. Como no encontramos ninguno que cumpliera nues- 
tros requisites, hubo que crear tres colecciones de gran escala en su lugar. 

4. Pensar y proponer aproximaciones que se ajusten a la tarea de clasificacion 
basada en etiquetas sociales. 

5. Evaluation de las aproximaciones propuestas. 

6. Realization de un riguroso analisis de los resultados, con el fin de llegar a 
unas conclusiones solidas. 

7. Presentation de resultados parciales en congresos y talleres nacionales e 
internacionales, con el fin de obtener comentarios y sugerencias de otros 
investigadores. 

8. Resumir en esta tesis la investigation, aportaciones, y conclusiones alcan- 
zadas a lo largo de todo el trabajo. 

Del paso 4 al 6, se realizo un proceso iterativo, realizandose dichos pasos de 
forma repetida varias veces. 
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Esta tesis esta compuesta de 8 capitulos. A continuation resumimos brevemente 
el contenido de cada uno de ellos: 

Capftulo 1 en la pagina 21 
Introduction 

Presentamos la motivation para el estudio del uso de anotaciones sociales 
para clasificacion de recursos. Formalizamos el problema y motivamos la 
necesidad de realizar dicho estudio. 

Capitulo 2 en la pagina 33 
Trabajo Relacionado 

Ofrecemos un resumen de los trabajos previos en el campo de investiga- 
tion. Resumimos los avances en campos relacionados, tanto en cuanto al 
uso de anotaciones sociales, como en cuanto a la clasificacion de recursos. 
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Capitulo 3 en la pagina 47 

Maquinas de Vectores de Soporte para Clasificacion a Gran Escala 

Realizamos un estudio de diferentes aproximaciones SVM para resolver el 
problema de la clasificacion de grandes colecciones de recursos sobre ta- 
xonomlas multiclase. Damos con la mejor aproximacion SVM para estos 
casos, y la utilizamos a lo largo del trabajo para realizar las tareas de clasi- 
ficacion. 

Capitulo 4 en la pagina 59 

Creadon de Colecciones de Etiquetado Social 

Describimos y analizamos en detalle las colecciones de etiquetado social 
utilizadas en esta tesis. Detallamos el proceso de generacion de dichas co- 
lecciones, y analizamos las principales caracterlsticas de sus correspondien- 
tes folksonomias. 

Capitulo 5 en la pagina 75 

Representando la Acumulacion de Etiquetas 

Proponemos y evaluamos diferentes representaciones de recursos que em- 
plean informacion de etiquetas sociales para la tarea de clasificacion de 
recursos. Estudiamos la utilidad de las etiquetas sociales en comparacion a 
otras fuentes de datos, y proponemos una representacion que saca el ma- 
ximo partido de ellas. Tambien abordamos el problema combinando las 
etiquetas sociales con las otras fuentes de datos disponibles para obtener 
un mejor rendimiento. 

Capitulo 6 en la pagina 95 

Analizando la Distribucion de Etiquetas para Clasificacion de Recursos 

Abordamos la idea de considerar la representatividad de las etiquetas den- 
tro de una coleccion de anotaciones de un sistema de etiquetado social. 
Estudiamos la aplicacion de funciones de pesado adaptadas a estos siste- 
mas, y analizamos su adecuacion teniendo en cuenta las configuraciones 
de cada sistema. 

Capitulo 7 en la pagina 111 

Analizando el Comportamiento de Usuarios para la Clasificacion 

Exploramos el efecto que puede tener el comportamiento de usuarios en 
sistemas de etiquetado social con vistas a una tarea de clasificacion de re- 
cursos. Basandonos en trabajos previos que sugieren la existencia de cier- 
tos usuarios que tienden a categorizar recursos, estudiamos si realmente se 
ajustan en mayor medida a la clasificacion de recursos. 

Capitulo 8 en la pagina 125 

Conclusiones y Trabajo Futuro 

Resumimos y comentamos las principales conclusiones y aportaciones del 
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trabajo. Presentamos las respuestas a las preguntas de investigacion formu- 
ladas al inicio, y planteamos el trabajo futuro. 

Ademas, la tesis contiene los siguientes apendices al final, con informacion 
adicional y resumenes en otros idiomas: 

Apendice A en la pagina 143 
Resultados Adicionales 

Presentamos algunos resultados adicionales, los cuales decidimos no in- 
cluir en el contenido de la tesis por claridad, pero que merece la pena 
mostrar ya que ayudan a demostrar y entender algunas conclusiones. 

Apendice B en la pagina 145 

Palabras Clave y Definiciones 

Listamos los terminos mas relevantes relacionados con los sistemas de eti- 
quetado social y proporcionamos definiciones detalladas. 

Apendice C en la pagina 147 
Lista de Acronimos 

Listamos los acronimos utilizados a lo largo de este trabajo e indicamos a 
que se refieren. 

Apendice D en la pagina 149 
Resumen 

Resumen del contenido de este trabajo en castellano. 

Apendice E en la pagina 167 

Laburpena (Resumen en euskera) 

Resumen del contenido de este trabajo en euskera. 

D.5 Preguntas de Investigacion Resueltas 

Pregunta de Investigacion 1 

iQue tipo de clasificador SVM deberia utilizarse para llevar a cabo este tipo de 
tareas de clasificacidn: un clasificador multiclase nativo, o una combination de 
clasificadores binarios? 

Se ha demostrado una clara superioridad de los clasificadores SVM multiclase 
nativos sobre las otras aproximaciones que combinan clasificadores binarios. Los 
resultados muestran que basarse en un conjunto de clasificadores binarios no 
es una buena opcion cuando se trata de taxonomias multiclase. Por lo tanto, los 
clasificadores multiclase nativos, que consideran todas las clases al mismo tiempo 
y tienen mas conocimiento de la tarea completa, funcionan mejor para estos casos. 
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Pregunta de Investigacion 2 

I Que metodo de aprendizaje rinde mejor para este tipo de tareas de clasificacion: 
uno supervisado o uno semisupervisado? 

Los metodos semisupervisados podrian rendir mejor cuando el subconjunto 
etiquetado es muy pequeno, pero los metodos supervisados, computacionalmen- 
te menos costosos, consiguen un rendimiento muy similar con unas pocas instan- 
cias mas etiquetadas. Por lo tanto, hemos mostrado tambien que, a diferencia de 
las tareas de clasificacion binarias como ya demostro Joachims (1999), un metodo 
supervisado obtiene unos resultados muy similares a los de un semisupervisa- 
do para estos casos de colecciones grandes y multiclase. Parece razonable pensar 
que predecir la clase de las instancias no etiquetadas es mucho mas dificil con 
el incremento del numero de clases y, por tanto, el incremento de errores en las 
predicciones se refleja tambien en la fase de aprendizaje del clasificador. 

Basandonos en estas conclusiones, decidimos utilizar un clasificador SVM 
multiclase supervisado a lo largo de esta tesis. 

Pregunta de Investigacion 3 

I Como afecta la configuration de los sistemas de etiquetado social en las anotacio- 
nes de los usuarios y las folksonomtas resultantes? 

Con este fin, hemos analizado diversas caracteristicas que se encuentran en la 
configuracion de los sistemas de etiquetado social. Entre las caracteristicas ana- 
lizadas, hemos mostrado el gran impacto de las sugerencias en el etiquetado, 
lo cual altera de forma considerable la folksonomla resultante. En los sistemas 
de etiquetado social que hemos estudiado, todos presentan alguna caracteristica 
diferente en este aspecto: 

• Sugerencias basadas en recursos (Delicious): cuando el sistema sugiere 
etiquetas asignadas por otros usuarios al recurso que se esta guardando, 
se reduce la probabilidad de utilizar nuevas etiquetas que aporten nueva 
information. En este caso, los usuarios dedican poco esfuerzo a pensar 
por ellos mismos, y prefieren basarse en las sugerencias provistas por el 
sistema. 

• Sugerencias basadas en la personomia (GoodReads): cuando el sistema 
sugiere etiquetas que el mismo usuario ha utilizado previamente, el voca- 
bulario de su personomia tiende a ser mucho mas reducido. No obstante, 
los usuarios no saben que es lo que otros han anotado sobre cada recurso, y 
por tanto es muy probable que aporten nuevas etiquetas que anteriormente 
no se habian anotado sobre el recurso. 

• Ausencia de sugerencias (LibraryThing): cuando el sistema no sugiere eti- 
quetas al usuario, el vocabulario de su personomia tiende a ser mayor, asi 
como las etiquetas asignadas a cada recurso son mas diversas. 
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Pregunta de Investigacion 4 

iCudl es la mejor manera de acumular las anotaciones de los usuarios sobre un 
recurso con elfin de obtener una representation? 

Hemos demostrado que es mejor tener en cuenta todas las etiquetas anotadas 
sobre un recurso que basarse solo en aquellas que han sido anotadas por mas 
usuarios. Las etiquetas mas anotadas han demostrado ser las mas importantes, y 
aportan la informacion mas relevante sobre la tematica del recurso. No obstante, 
las etiquetas menos populares tambien pueden ser utiles en menor medida, apor- 
tando otro tipo de informacion util que mejora el rendimiento del clasificador. 

En cuanto a los pesos que se asignan a las etiquetas al representar el recur- 
so, los mejores resultados se obtienen considerando el numero de usuarios que 
anotan cada etiqueta. El uso de este valor ha producido los mejores resultados 
en nuestros experimentos, superando a otras aproximaciones que ignoran estos 
pesos, y demostrando que no hace falta considerar el numero total de usuarios 
que anota el recurso. 

Por lo tanto, a partir de nuestros experimentos, concluimos que la mejor re- 
presentation es aquella que aprovecha todas las etiquetas, asignando como peso 
el numero de usuarios que las ha anotado. 

Pregunta de Investigacion 5 

A pesar de la utilidad de las etiquetas sociales para estas tareas, imerece la pena 
considerar otras fuentes de datos como el contenido de los recursos para mejorar 
aun mas los resultados? 

Utilizando tecnicas de combination de clasificadores, los cuales consideran 
las predicciones de diferentes clasificadores, hemos demostrado que las etiquetas 
aportan criterios fiables a tener en cuenta. Estos criterios son muy utiles para 
combinar dichas etiquetas con otras fuentes de datos. No obstante, no todas las 
fuentes de datos son utiles para combinar, y se deben seleccionar con cautela 
las que obtienen unos resultados solidos y, ademas, ofrecen unas predicciones 
fiables. Cuando las fuentes de datos se escogen de manera apropiada, la mejora 
de rendimiento es considerable. 

Pregunta de Investigacion 6 

iSon las etiquetas sociales tambien utiles y suficientemente espetificas para clasifi- 
car recursos en categorias a nivel mas bajo? 

Hemos analizado la utilidad de las etiquetas sociales para la clasificacion 
sobre dos niveles diferentes de las taxonomlas. Ademas de las categorias de mas 
alto nivel, tambien hemos explorado la clasificacion sobre categorias del segundo 
nivel, mas precisas. En este aspecto, los resultados usando etiquetas sociales han 
sido superiores a los obtenidos con otras fuentes de datos para aquellos sistemas 
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de etiquetado social que animan a los usuarios a aportar anotaciones (Delicious 
y LibraryThing). La superioridad es muy clara en estos casos, sobre todo para 
Delicious, donde la diferencia es aun mayor cuando se trata del segundo nivel 
taxonomico. Esta diferencia es muy similar para LibraryThing. Por ultimo, las 
etiquetas de GoodReads no superan a las otras fuentes de datos, ni siquiera para 
el primer nivel, ya que el sistema no anima a los usuarios a anotar los libros, con 
lo que muchos de los marcadores se quedan sin etiquetas. 

Estos descubrimientos arrojan una conclusion diferente a la que dan Noll and 
Meinel (2008a), donde los autores lanzan la hipotesis de que las etiquetas sociales 
podrlan no ser utiles para niveles mas bajos de las taxonomias, y que deberlan 
utilizarse otros tipos de datos para estos casos. 

Pregunta de Investigacion 7 

iPodemos tener en cuenta la distribution de etiquetas a lo largo de la coleccion 
para aslmedir la representatividad general de la etiqueta? 

A traves de la experimentacion llevada a cabo en esta tesis, hemos demos- 
trado la utilidad de considerar las distribuciones de etiquetas a lo largo de la 
coleccion, por medio de una funcion de pesado inversa como la ofrecida por IDF. 
Estas funciones han servido para determinar la representatividad de las etiquetas 
para cada coleccion, con el fin de mejorar el rendimiento de la tarea de clasifica- 
cion de recursos. No obstante, hemos mostrado que la configuracion del sistema 
de etiquetado social tiene mucho que ver con esas distribuciones. Entre las ca- 
racteristicas en la configuracion de los sistemas, se ha visto que las sugerencias 
basadas en los recursos influyen en gran medida la estructura de las folksono- 
rmas resultantes. Aquellos sistemas que sugieren etiquetas al usuario, basandose 
en anotaciones previas sobre el recurso, producen unas distribuciones de etique- 
tas muy diferentes a aquellos que no sugieren etiquetas y dejan a los usuarios 
que hagan su propia eleccion. Esta caracteristica ha sido determinante tambien 
para la aplicacion con exito de las funciones de pesado sobre estas distribuciones. 

Hemos descubierto que las funciones de pesado de etiquetas propuestas su- 
peran claramente a la aproximacion basada en TF cuando el sistema no sugiere 
etiquetas basadas en los recursos (es decir, en LibraryThing y GoodReads), tanto 
cuando se utilizan por si solas, como cuando se combina con otras fuentes de 
datos. En realidad, es mejor considerar simplemente la aproximacion basada en 
etiquetas que combinarla con otras fuentes de datos, ya que por si sola ofrece los 
mejores resultados, los cuales no son mejorados cuando se combinan. 

No obstante, cuando el sistema sugiere etiquetas basadas en el recurso, las 
folksonomias generadas son muy diferentes al resto. Esto afecta a las distribucio- 
nes de etiquetas en gran medida y por lo tanto, a las funciones de pesado que 
hemos estudiado. Debido a ello, el uso de funciones de pesado de etiquetas obtie- 
ne peores resultados que no tenerlos en cuenta, y necesitan ser combinadas con 
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otras fuentes de datos para funcionar mejor. En este ultimo caso, pueden llegar 
a mejorar a la aproximacion basada en TF, gracias a las buenas predicciones que 
aporta, que ayuda a alimentar de forma adecuada la combination de clasificado- 
res. 

Pregunta de Investigacion 8 

iCudl es la mejor aproximacion para establecer la representatividad de las etiquetas 
en la coleccion? 

Entre las funciones de pesado que hemos estudiado, aquella que se basa en 
las frecuencias en marcadores ha demostrado ser la mejor para los sistemas sin 
sugerencias de etiquetas basadas en recursos. En estos casos, IBF es la mejor op- 
tion, seguida por IRF e IUF. Todos ellos superan con claridad a TF, tanto cuando 
se utilizan por si solas, como cuando se combinan con otras fuentes de datos. 

Por otro lado, cuando el sistema sugiere etiquetas basadas en el recurso es 
mejor basarse en la frecuencia en usuarios. IUF funciona mejor que IBF e IRF en 
estos casos, debido a la importancia de aquellos usuarios que tienden a escoger 
sus propias etiquetas en lugar de basarse en las sugerencias. Aunque ni siquiera 
IUF supera a TF, cuando se combina con otras fuentes de datos llega a ser la mejor 
option. No obstante, los resultados de este ultimo caso son solo ligeramente su- 
periores a los obtenidos por la combination que utiliza TF, por lo que cualquiera 
de ellas podria emplearse para llegar a obtener unos resultados parecidos. 

Pregunta de Investigacion 9 

iPodemos discriminar diferentes perfiles de usuario de manera que encontremos 
un subconjunto de usuarios que proporciona anotaciones que se ajustan en mayor 
medida a la tarea de clasificacion? 

Hemos demostrado que dicho tipo de usuario, llamado Categorizador, en 
realidad existe. Segun nuestros experimentos, esto es verdad sobre todo cuando 
se trata de sistemas sin sugerencias de etiquetas como en LibraryThing, donde la 
clasificacion de recursos realizada utilizando etiquetas de los usuarios Categori- 
zadores obtiene mejores resultados. Cuando las sugerencias existen, la detection 
de usuarios que se adecuan a la tarea se complica, como hemos demostrado 
que ocurre con GoodReads y Delicious. Sin embargo, la utilization de la medida 
apropiada puede producir una selection exitosa de usuarios que se ajustan a las 
caracteristicas de un Categorizador. 

Pregunta de Investigacion 10 

iCudles son las caracteristicas que identifican a un usuario como apropiado para la 
tarea de clasificacion de recursos? 

De las dos caracteristicas que hemos considerado en este trabajo, hemos visto 
que si se diferencian los usuarios por su nivel de verbosidad, se puede encontrar 
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un conjunto de usuarios que se ajustan mas a la tarea de clasificacion. Por otra 
parte, hemos visto que separando usuarios por la diversidad de su vocabulario no 
se consigue una buena discrimination para este fin, sino para encontrar otro tipo 
de usuarios llamados Descriptores. Ademas de esto, hemos visto que aquellos 
usuarios que no utilizan datos descriptivos en sus anotaciones ofrecen etiquetas 
que se ajustan mejor a la clasificacion de recursos. 

D.6 Principales Contribuciones 

La idea novedosa de este trabajo de investigacion se basa en la utilization de 
anotaciones sociales para enriquecer una tarea de clasificacion de recursos. Has- 
ta donde nosotros sabemos, el primer trabajo de investigacion que llevo a cabo 
experimentos con tareas de clasificacion reales fue nuestro primer trabajo en este 
campo (Zubiaga et al., 2009d). Previamente, solo Noll and Meinel (2008a) hablan 
realizado un analisis estadlstico que comparaba etiquetas sociales con una cla- 
sificacion hecha por expertos. Teniendo en cuenta la carencia de trabajos en el 
area, la investigacion recogida en esta tesis aporta nuevo conocimiento hacia el 
uso y modo de representation apropiados de etiquetas sociales para la clasifi- 
cacion de recursos. Concretamente, nuestras aportaciones principales al area de 
investigacion son las siguientes: 

• Hemos creado 3 colecciones de gran escala que incluyen tanto etiquetas 
sociales como information de la categoria correspondiente para una serie 
de recursos. Estas pueden consider arse como unas de las mayores coleccio- 
nes utilizadas en el area de investigacion y, por lo que nosotros sabemos, 
las mayores utilizadas para clasificacion de recursos. Algunas de estas co- 
lecciones, junto con otras mas pequenas que hemos creado a lo largo del 
trabajo, se han hecho publicas para fines de investigation 11 . Entre otros, 
Godoy and Amandi (2010) y Strohmaier et al. (2010b) han utilizado alguna 
de nuestras colecciones para su investigacion. 

• Nuestro trabajo es el primero que compara diferentes representaciones de 
recursos usando etiquetas sociales. Ademas, es el primer trabajo que realiza 
tareas de clasificacion comparando etiquetas sociales con otros tipos de 
fuentes de datos. Hemos demostrado que las etiquetas sociales son tambien 
utiles para categorias mas precisas de mas bajo nivel. Al contrario de lo que 
indican que Noll and Meinel (2008a), donde los autores realizan un estudio 
estadlstico con el que concluyen que las etiquetas sociales podrian no ser 
utiles para categorias mas precisas, hemos demostrado que son aun mas 
utiles que para categorias mas generales. 



http:/ / nlp.uned.es/ social-tagging/ datasets/ 
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• Hemos analizado las distribuciones de etiquetas sociales en folksonomias, 
y hemos realizado un riguroso estudio de como la configuracion de un 
sistema de etiquetado social afecta tales distribuciones. En este aspecto, 
hemos adaptado funciones de pesado basadas en la consolidada TF-IDF al 
ambito del etiquetado social y las folksonomias. 

• Hemos mostrado la existencia de un grupo de usuarios, llamados Catego- 
rizadores, cuyas anotaciones se parecen mas que las de otro grupo de usua- 
rios, llamados Descriptores, a la clasificacion hecha por expertos. Aunque 
la aproximacion para diferenciar Categorizadores y Descriptores ya esta- 
ba consolidada de previos trabajos, en este hemos llevado a cabo la tarea 
de demostrar que los Categorizadores se ajustan mas a la clasificacion de 
recursos. 

La utilizacion de anotaciones sociales para el beneficio de tareas de clasifica- 
cion de recursos era una linea de investigacion nueva al comienzo de esta tesis. 
Sin embargo, el crecimiento en el interes de los investigadores sobre contenidos 
generados por usuarios en medios de comunicacion social, y concretamente en 
los sistemas de etiquetado social, ha ocasionado recientemente la aparicion de nu- 
merosos trabajos en el area. Junto con este crecimiento, mas investigadores han 
mostrado su interes en utilizar anotaciones sociales para clasificacion de recursos, 
y el numero de trabajos relacionados ha aumentado considerablemente. Godoy 
and Amandi (2010), por ejemplo, presentan un estudio de clasificacion basada en 
etiquetas que se inspira en un trabajo nuestro (Zubiaga et al., 2009d). 

D.7 Trabajo Futuro 

La utilizacion de anotaciones sociales para la clasificacion de recursos es un cam- 
po de investigacion que esta aun en sus inicios, y se ha realizado relativamente 
poco trabajo hasta el momento. El trabajo presentado en esta tesis concluye con la 
manera de representar etiquetas sociales en busca de una clasificacion de recursos 
lo mas precisa posible. Ademas, da lugar al planteamiento de diversos trabajos 
futuros. 

A lo largo de esta tesis hemos considerado cada etiqueta como un simbolo 
diferente, sin tener en cuenta su significado semantico. En este aspecto, nuestros 
planes para trabajo futuro incluyen el analisis del significado de las etiquetas pa- 
ra tratar de descubrir palabras sinonimas y relaciones entre ellas. Bien utilizando 
tecnicas de procesamiento de lenguaje natural, o bien mediante aproximaciones 
semanticas, esto podria ayudar a entender el significado de cada etiqueta, pu- 
diendo explorar mas alia el conocimiento que aportan las folksonomias. 

Las tres funciones de pesado que hemos empleado en el Capitulo 6 se basan 
en la conocida TF-IDF, que fue disenada inicialmente para colecciones de texto. 



D.7 Trabajo Future 



165 



Pensamos que probar otras funciones de pesado, asf como explorar la posible de- 
finicion de una nueva funcion que se ajuste a las necesidades de estas estructuras 
sociales, pueden resultar en interesantes aportaciones como trabajo futuro. Esto 
ayudarla sobre todo para sistemas que dan sugerencias de etiquetas basadas en 
recursos, como pasa con Delicious, donde las funciones de pesado que hemos 
experimentado no han dado buenos resultados. 
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"Hizkuntza bat ez da galtzen ez dakitenek ikasten ez dutelako, dakitenek erabiltzen ez 
dutelako baizik. " 
— Joxean Artze 

Baliabideen Sailkapenerako Folksonomien 

Ustiapena 

Tesi honetan baliabideen sailkapenaren gainean dihardugu, egu- 
neroko bizitzan hain garrantzitsua eta ohikoa den ataza bat lan- 
duz, liburuak katalogatzea edo bideoak antolatzea izan daite- 
keen bezalaxe. Ataza burutzeko, etiketa sozialen sistemetan era- 
biltzaileek egindako anotazioez baliatzen gara. Webgune haue- 
tan baliabide ezberdinen gainean metadatu ugari eskaini ohi di- 
tuzte erabiltzaileek. Orain arte, gutxi dira metadatu hauek hel- 
buru honetarako erabili dituzten ikerketa lanak, eta gutxi horiek 
analisi estatistikoak egitera mugatu dira. Tesi honetan, sistema 
hauen, bertako erabiltzaileen eta haien anotazioen ezaugarriak 
aztertzen ditugu, datu-sorta handi hauetaz ahal bezainbeste pro- 
fitatu nahian, eta ahalik eta baliabideen sailkapen automatiko 
zehatzena lortu asmoz. 

E.l Motibazioa 

Edozein motatako baliabideak aurrez definitutako kategoriatan sailkatzea ohiko 
ataza da gure eguneroko bizitzan. Baliabideei kategoriak esleitzeak ondoren be- 
rreskuratu ahal izateko erraztasunak eskaintzen ditu, bilaketa nahi den kategoria- 
ra mugatuz. Esate baterako, liburuzainek gaika antolatu ohi dituzte liburuak ka- 



168 



Laburpena (Basque Summary) 



talogoetan. Horrez gain, filmen datubaseak, musika katalogoak eta fitxategi siste- 
mak, besteak beste, kategoriatan antolatu ohi dira baliabideok aurkitzea erraztuz. 
Era berean, Yahoo! Directory eta Open Directory Project bezalako web direktorioek 
kategoriatan antolatzen dituzte web orrialdeak. Web orrialdeak sailkatuta izateak 
interneteko bilatzaileen funtzionamendua hobe dezake emaitzak erabiltzailearen 
intereseko kategoriara mugatuz (Qi and Davison, 2009). 

Kategorizatze lan hori eskuz egitea, ordea, oso garestia izaten da baliabide 
sorta handia denean. Adibide gisa, Estatu Batuetako Library of Congress liburutegi 
publikoak 2002an profesionalek katalogatutako liburu bakoitzak 94,58 dolarreko 
kostua izan zuela adierazi zuen 1 . Urte hartan katalogatu zituzten 291.749 erregis- 
troengatik 27,5 milioitik gora ordaindu behar izan zituzten beraz. Ataza hau zein 
garestia den ikusita, sailkatzaile automatikoetara pasatzeak alternatiba egokia di- 
rudi eskulana gutxitzeko, betiere katalogoak eguneratuta mantenduz. 

Orain arte, sailkatzaile automatiko gehienak baliabideen edukian oinarritu di- 
ra informazio iturri gisa, web orrialdeen sailkapenari dagokionean batik bat (Qi 
and Davison, 2009). Baliabideen edukiek ez dute beti informazio esanguratsua 
izaten, ordea, eta horrek zaildu egiten du ataza. Gainera, batzutan ez da erra- 
za izaten liburuak eta filmeak bezalako baliabideentzako datu nahikoa lortzea. 
Horrelako kasuetan zailagoa izaten da edukia errepresentatzea, eta litekeena da 
edukia erraz prozesatu daitekeen formatu batean ez izatea. 

Arazo hauentzako soluzio posible bezala, etiketa sozialen sistemek baliabi- 
deei dagozkien metadatuak eskuratzeko modu errazago eta merkeagoa eskain- 
tzen dute. Delicious 2 , LibraryThing 3 eta GoodReads 4 bezalakoek baliabideen in- 
guruan erabiltzaileek definitutako etiketak batzen dituzte. Erabiltzaileek sortuta- 
ko etiketa hauek baliabideen edukiak deskribatzen dituzten datu esanguratsuak 
direla frogatu da (Heymann et al., 2008). 

Etiketa hauen bitartez, baliabideen sailkapen propio baten antzekoa eskain- 
tzen dute erabiltzaileek. Eta etiketa hauek modu sozialean elkarbanatzen dira 
komunitatearekin. Sistema hauetan erabiltzaile kopuru handiek parte hartzen 
dutenez, beraien anotazioak baliabideen gainean batu egiten dira. Ondorioz, era- 
biltzaile ezberdinen anotazioak baliabideetan batzeko gaitasun horrek are era- 
bilgarriago eta baliagarriago egiten du anotazio horietako bakoitza. Komunitate 
aktiboetako erabiltzaileek laster-marka, etiketa eta anotatutako baliabide sorta 
handiak sor ditzakete. 

"Sailkapen indibidual bakoitzak profesional batek egindakoak baino gutxiago 
balio du. Baina ugari, mordoxka bat daude" , Joshua Schachter, Delicious- 

1 http: / / www.loc.gov/loc/lcib / 0302/ collections.html 

2 http:/ /delicious. com 

3 http: / / www.librarything.com 

4 http: / / www.goodreads.com 



E.l Motibazioa 



169 



en sortzailea, 2006ko FOWA bilkuran, Londresen (Ingalaterra) 5 . 

Etiketa sozialen sistemak baliabideak gorde, antolatu eta bilatzeko tresnak di- 
ra, erabiltzaileek hautatutako etiketak baliatuz anotatzea ahalbidetzen dutenak. 
Gure ustez, anotazio hauek nabarmen hobe dezakete baliabideen sailkapen auto- 
matikoa. Erabiltzaileek sortutako anotazio hauek erabilgarri izan litezke baliabi- 
deen kategoriaren inguruko informazioa ematen duen informazio iturri gisa. 

Baliabide bakoitzaren gainean erabiltzaile askok esleitzen dituenez anota- 
zioak, gure helburu nagusia berauen ekarpenak batzeko modu egokia aurkitzean 
datza, betiere profesionalek egindako kategorizazioarekiko antzekoa den antola- 
keta lortuz asmoz. Erabiltzaile askok metadatu kopurua handia esleitzen duenez, 
gure erronka ahalik eta emaitza onena lortzean datza. 

"Garaiotan datuak eskuratzea erraza da, baina hauek zentzuz erabiltzea ea 
da hain erraza", Danah Boyd, Microsoft Research New England-eko 
Social Media gaineko ikertzailea, WWW2010 kongresuan, Raleigh, 
Ipar Karolina (Ameriketako Estatu Batuak) 6 . 

E.l.l Baliabideen Sailkapena 

Baliabideen sailkapena aurrez definitutako kategoria sorta batean baliabideak an- 
tolatzean datzan ataza da. Tesi honetan, Euskarri Bektoredun Makinak darabilz- 
kigu (Support Vector Machines, SVM, Joachims (1998)), sailkapen metodo aban- 
goardista. Sailkapen ataza mota hauek aurrez sailkatutako baliabide sorta batean 
oinarritzen dira, berau sailkatzaileak behar duen ezagutza eraikitzeko baliatzen 
delarik. 

Baliabideen sailkapen ataza batek ezaugarri ezberdinak izan ditzake. Aide 
batetik, sistemaren ikasketa metodoari dagokionean, gainbegiratua dela esaten 
da ikasteko erabilitako baliabide guztiak aurrez sailkatuta daudenean, eta erdi- 
gainbegiratua dela, ostera, sailkatu gabeko baliabideen gainean egindako aurrei- 
kuspenak ere ikasteko erabiltzen direnean. Bestalde, kategoria kopuruari dago- 
kionean, sailkapena bitarra izan daiteke, bi kategoria baino ezin direnean esleitu, 
edo kategoria-anitza, hiru edo kategoria gehiago daudenean. Lehena iragazte sis- 
temetarako erabili ohi da, bigarrena taxonomia handiegoekin erabiltzen delarik, 
adibidez, gaikako sailkapena. 

Baliabideen kolekzio handien gaikako sailkapena burutzeko, Web-eko orrial- 
deak edo liburutegietako liburuak izan daitezkeen bezalaxe, taxonomiak bi ka- 
tegoria baino gehiagokoak izan ohi dira, eta aurrez sailkatutako baliabide ko- 
purua oso murritza izaten da. Beraz, interesgarria deritzogu bai teknika erdi- 
gainbegiratuak eta bai kategoria-anitzak kontuan hartu eta aztertzea, ataza hauek 

5 http:/ / simonwillison.net/2006/Feb/ 8/ summit/ 
6 http://www.danah.org/papers/talks/2010/WWW2010.html 
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burutzeko aukerarik onena zein den jakin ahal izateko. 

Tesi honetan, SVM algoritmoan oinarritzen diren hainbat metodoren anali- 
sia proposatzen dugu, ataza hauekiko duten apropostasuna aztertuz. Metodo 
hauen artean teknika kategoria-anitz ezberdinak aztertzen ditugu, eta baita tek- 
nika gainbegiratu zein erdi-gainbegiratuak ere. 

E.1.2 Anotazio Sozialak 

Etiketa sozialen sistemek baliabide gogokoenak (web orrialdeak, filmeak, libu- 
ruak, argazkiak edo musika, besteak beste) gorde eta anotatzeko aukera eskain- 
tzen diete erabiltzaileei, komunitatearekin elkarbanatuz. Anotazio hauek etiketa 
moduan eman ohi dituzte erabiltzaileek. Etiketatzea baliabideei hitz gakoak edo 
etiketak esleitzeari deritzo, deskribatzeko zein antolatzeko aukera emanez. Ho- 
nek ondoren baliabideok bilatzea errazten du, etiketa horiek bilaketa gako bezala 
baliatuz. Sistema gehienetan ez daude aurrez definitutako etiketak, eta beraz nahi 
duten hitzak hauta ditzakete erabiltzaileek etiketa gisa. 

"Etiketatzeak interfazearekin zerikusi handia du - jendeak gauzak gogora- 
tzeko modu bat, gorde zuten unean zertan pentsatzen ari ziren erakusten 
duena. Nahiko erabilgarria gogoratzeko, ona deskubritzeko, ikaragarria he- 
datzeko (non argitaratzen dituztenek ahal bezainbeste etiketa definitzen di- 
tuzten kutxa gehiagotan sailkatzeko).", Joshua Schachter, Delicious-en 
sortzailea, 2006ko FOWA bilkuran, Londresen (Ingalaterra) 7 . 

Etiketatze prozesu honen bitartez folksonomia deritzon egitura sortzen da, 
erabiltzaileek sortutako baliabideen antolaketa, alegia. Folksonomia folk (jendea), 
taxis (sailkapena) eta nomos (kudeaketa) hitzen laburtzapena da. Komunitatean 
oinarritzen den taxonomia bezala ere ezagutzen da folksonomia, non sailkapen 
mota ez-hierarkikoa den, adituek egindako sailkapen taxonomikoetan ez bezala. 
Beraz, folksonomiek badute nolabaiteko zerikusia adituek egindako sailkapene- 
kin, baliabideak taldeka sailkatzen baitira era berean. 

Anotazio hauek sozialak direla esan ohi da ingurune sozial batean komuni- 
tatearekin elkarbanatuz beste guztientzako erabilgarri agertzen direnean. Honek 
dakarren abantaila nagusia besteek ipinitako etiketak baliatuz bilaketak egin ahal 
izatea da. Era berean, hauxe da erabiltzaile asko parte hartzera animatzen duen 
ezaugarrietako bat. 

Anotazio guztiak ez dira modu berean elkarbanatzen, ordea. Etiketa sozia- 
len guneak berak baldintza batzuk defini ditzake, baliabide bakoitza nork anota 
dezakeen mugatuz, batez ere. Honi dagokionean, bi sistema mota ezberdin di- 
tzakegu (Smith, 2008): 



7 http: / / simonwillison.net/ 2006/Feb /8/ summit/ 
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• Etiketen sistema sinpleak: erabiltzaileek norbere baliabideak etiketa di- 
tzakete (adibidez, argazkiak Flickr-en 8 , bideoak Youtube-n 9 edo albisteak 
Digg-en 10 ), baina inork ezin ditu besteen baliabideak etiketatu. Normalean, 
baliabidearen egilea bera izaten da etiketatzen duena. Ondorioz, baliabide 
bakoitza erabiltzaile batek baino ez du etiketatzen. Etiketen sistema sin- 
pleetan erabiltzaile sorta bat (U) izaten da, baliabide batzuen (R) gainean 
etiketa sorta bat (T) esleitzen duena. Uj G U erabiltzaile batek tj G R 
bere baliabidea p etiketa kopuru aldagarridun Tj = {tji, —,tjp} etiketa- 
sortarekin anotatzen du. r,- baliabideari esleitutako etiketa-sortak T; izaten 
jarraituko du aurrerantzean, beste inork ezingo baitu anotatu. 

• Etiketen sistema kolaboratiboak: erabiltzaile askok anotatzen dute baliabi- 
de bera, bakoitzak etiketa ezberdin batzuk baliatuz. Erabiltzaile bakoitzak 
erabilitako etiketa sortak folksonomia txikiago bat sortzen du, pertsono- 
mia deritzona. Sistema hauetan hainbat erabiltzailek etiketatu ohi du balia- 
bide bera. Esate baterako, CiteULike.org, LibraryThing.com eta Delicious 
etiketatze kolaboratiboan oinarritzen dira, non baliabide bakoitza (artiku- 
luak, liburuak eta URLak, hurrenez hurren) interesgarri deritzon erabiltzai- 
le orok etiketa dezakeen. Etiketatze sistema kolaboratiboak sinpleak baino 
konplexuagoak dira. Sistema hauetan, erabiltzaile sorta bat (U) izaten da, 
baliabide batzuen (R) gainean laster-marka batzuk (B) gordetzen dabilena, 
etiketa sorta batekin (T) anotatuz. G IT erabiltzaile batek r; G R balia- 
bidearen bjj G B laster-marka gorde dezake p etiketa kopuru aldagarridun 
Tjj = {tin, ...,tfjp} etiketa-sorta baliatuz. k erabiltzailek r; baliabidea gorde 
eta gero, Tj = {li>ntn / ...,Wj n tj n } pisudun etiketa-sorta bezala defini daitez- 
ke bere anotazioak, non Wn, W; n < k aldagaiek etiketa bakoitzaren es- 
leipen kopurua adierazten duten. Ondorioz, laster-marka bakoitzak erabil- 
tzaile, baliabide eta etiketa-sorta bana ditu bere baitan: : Uj XTj x TW. Era- 
biltzaile bakoitzak baliabide ezberdinen laster-markak egiten ditu, eta aldi 
berean baliabide batek erabiltzaile ezberdinek egindako laster-markak izan 
ditzake. Erabiltzaile baten laster-marketako etiketak bateratzearen emaitza 
pertsonomia izenez ezagutzen da: T; = {^ilUl/ WimUm}' non m erabil- 
tzaileak dituen etiketa ezberdinen kopurua den. 

Tesi honetan etiketatze kolaboratiboko sistemekin dihardugu. E.l. Irudiak 
bi sistema hauen arteko ezberdintasunak erakusten ditu adibide baten bi- 
tartez. 

Baliabide bera anotatzen duten erabiltzaileen artean etiketak kointziditzeko 
probabilitatea altua izaten da. Ezaugarri honek bereziki interesgarri egiten du 

8 http:/ /www.flickr.com 
9 http:/ / www.youtube.com 
10 http:/ / digg.com 
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Figura E.l: Etiketatze sinple eta kolaboratiboko sistemetako anotazioen arteko 
konparazioa. 

etiketatze kolaboratiboetako erabiltzaileen bateratze hau etiketatze sinpleekin al- 
der atuz. 

Etiketatze sistema kolaboratibo batean, adibidez, erabiltzaile batek lan hauxe 
bera anotatzeko etiketa-sozialak, ikerketa eta tesia etiketak erabil litza- 
ke, eta beste erabiltzaile batek etiketa-sozialak, markatzaile-sozialak, 
doktoretza eta tesia etiketak. Erabiltzaile bakoitzaren jarrera oso ezberdi- 
na izan daiteke sistema hauetan, eta horrexegatik izaten dira kontuan beraien 
guztien anotazioak bateratzeko orduan. Adibide bezala aipatutako bi horien ano- 
tazioak batuz hurrengoa lortuko genuke: tesia (2), etiketa-sozialak (2), 
markatzaile-sozialak (1), doktoretza (1) eta ikerketa (1). 

Tesi honetan, erabiltzaileek etiketen sistema sozialetan egindako anotazioak 
aztertu eta ikertzen ditugu. Baliabiden sailkapen egoki bat lortzeko etiketa ho- 
riengandik etekina ateratzeko ikerketa lana aurkezten dugu. Konkretuki, erabil- 
tzaileek sortutako folksonomiek adituek egindako sailkapen baten antzeko zer- 
bait lortzeko duten erabilgarritasuna aztertzen dugu. Etiketa sozialen errepre- 
sentazio ezberdinak aztertzen ditugu bertan, betiere adituek egindako sailkapen 
horietara hurbildu asmoz. Bereziki, etiketa sozialei etekina ateratzea da gure hel- 
burua, bai errepresentazio egokiena bilatuz, eta baita etiketek baliabide, laster- 
marka eta erabiltzaile ezberdinetan aurkezten dituzten distribuzioen eraginari 
erreparatuz. Azkenik, sailkapen ataza batetik gertuago dauden erabil tzaileak bi- 
latzeko ikerketa aurkezten dugu, horretarako erabiltzaileen jarrera antzemateko 
teknika abangoardistez baliatuz. 

E.2 Helburuak 

Tesi honen helburu nagusia etiketa sozialen sistemetan aurkitzen diren anotazio 
horiei guztiei etekina ateratzeko egin beharreko erabilpen egokiaren inguruan 
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ezagutza berria zabaltzea da. Baliabideak sailkatzeak duen interesa jakinda, eta 
berauek errepresentatzeko datu esanguratsuen gabezia kontuan izanda, gure hel- 
burua baliabideen sailkapenerako etiketa sozialek lagun dezaketena aztertu, eta 
erabilpen aproposa egiteko modurik egokiena zein den jakitea da. Lan honen ha- 
sieran, hau ikertzen zuen lanik ez zegoela ikusi genuen. Horrek motibatu gintuen 
ikerketa lan hau aurrera eramatera. Helburu honekin, hurrengo planteamendua 
egin genuen, lanaren xede nagusia laburbilduz: 

Lanaren Planteamendua 

Nola egin daiteke erabiltzaileek etiketen sistema sozialetan egindako anotazioak us- 
tiatzeko baliabideen sailkapen ahalik eta zehatzena lortuz? 

E.3 Metodologia 

Tesi hau aurrera eramateko jarraitu den ikerketa metodologiak hurrengo 6 pau- 
soak jarraitu ditu: 

1. Artearen egoeraren azterketa eta irakurketa sakona, eta baita etiketa sozia- 
len sistemak ikertu eta funtzionamendua ondo ulertzea ere. 

2. Lana burutu ahal izateko SVM sailkatze egokia aurkitzea. 

3. Etiketen sistema sozialetan oinarrituz sortutako kolekzioak bilatzea. Gure 
beharrak betetzen zituenik aurkitu ez genuenez, gureak sortzea erabaki 
genuen, tamaina handiko hiru sortuz. 

4. Etiketa sozialetan oinarrituz, eta sailkapen atazan baliatzeko direla kontuan 
hartuz, lana aurrera eramateko aproposak diren hurbilketak eta errepresen- 
tazioak pentsatzea eta proposatzea. 

5. Proposatutako hurbilketa eta errepresentazioak ebaluatzea. 

6. Emaitzen azterketa sakona burutzea, ondorio sendoetara iritsi asmoz. 

7. Egindako lanaren emaitza partzialak kongresu eta tailer nazional eta in- 
ternazionaletan aurkeztu, beste ikertzaileen iritzi eta gomendioak jaso ahal 
izateko. 

8. Egindako ikerketa lana, ekarpen nagusiak, eta lortutako ondorioak tesi ho- 
netan batu eta laburbildu. 

4. pausotik 6.era, behin eta berriz errepikatu zen prozesua, pauso horiek behin 
baino gehiagotan burutuz. 
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E.4 Tesiaren Egitura 

Tesi honek 8 kapitulu ditu. Jarraian azaltzen da labur-labur kapitulu hauetako 
bakoitzaren edukia zein den. 

1. kapitulua 21. orrialdean 

Sarrera 

Baliabideen sailkapenerako etiketa sozialak ikertu nahi izateko motibazioa 
aurkezten dugu. Ataza formalki azaldu, eta ikerketa burutzeko beharra 
motibatzen dugu. 

2. kapitulua 33. orrialdean 

Erlazionatutako Lana 

Arlo honetan eta erlazionatutakoetan lehenago egindako lanak laburbil- 
tzen ditugu, bai etiketa sozialen erabilpenean, eta baita baliabideen sailka- 
penean ere. 

3. kapitulua 47. orrialdean 

Euskarri Bektoredun Makinak Neurri Handiko Sailkapenerako 

Taxonomia kategoria-anitzetan baliabide kolekzio handien sailkapenerako 
SVM hurbilketa ezberdinen analisia aurkezten dugu. Tesian zehar erabil- 
tzeko aproposena den SVM hurbilketa zein den jakitea ahalbidetzen digu 
ikerketa honek. 

4. kapitulua 59. orrialdean 

Etiketa Sozialen Datu-Sorten Sorkuntza 

Lan honetan guztian zehar erabiltzeko sortu genituen etiketa sozialen ko- 
lekzioak zehatz-mehatz deskribatu eta aztertzen ditugu. Kolekzioak sor- 
tzeko jarraitutako prozesua azaldu, eta folksonomien ezaugarri nagusiak 
aztertzen ditugu. 

5. kapitulua 75. orrialdean 

Etiketen Gehikuntzaren Errepresentazioa 

Baliabideen sailkapenerako etiketa sozialen errepresentazio ezberdinak pro- 
posatu eta ebaluatzen ditugu. Etiketa sozialek beste datu iturri batzuekin 
alderatuta baliabideen sailkapenerako duten errendimendua ikertzen du- 
gu, eta ataza burutzeko errepresentazio egokiena zein izan daitekeen az- 
tertzen dugu. Horrez gain, etiketa sozialak beste datu iturriekin nahasten 
ditugu errendimendua hobetu ahal izateko. 

6. kapitulua 95. orrialdean 

Baliabideen Sailkapenerako Etiketen Distribuzioaren Azterketa 

Etiketa sozialen sistemetako etiketa bakoitzak baliabideen sailkapenerako 
duen adierazgarritasuna aztertzen dugu. Horretarako, sistema hauetarako 
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egokitutako pisu-funtzioak erabiltzen ditugu. Gainera, funtzio hauek zen- 
baterainoko aproposak diren aztertzen dugu, betiere sistema bakoitzaren 
ezarpenei erreparatuz. 

7. kapitulua 111. orrialdean 

Sailkapenerako Erabiltzaileen Jarreraren Analisia 

Etiketa sozialen sistemetako erabiltzaileen jarrerak baliabideen sailkape- 
nean izan dezakeen eragina aztertzen dugu. Sailkatzea helburu duten era- 
biltzaileak existitzen direla dioten aurreko lanetan oinarrituz, erabiltzaile 
horiek baliabideen sailkapenerako egokiagoak diren aztertzen dugu. 

8. kapitulua 125. orrialdean 

Ondorioak eta Etorkizunerako Ildoak 

Lanaren ondorio eta ekarpen nagusiak laburbiltzen ditugu. Horrez gain, la- 
naren hasieran formulatutako galderei erantzun, eta etorkizunerako ildoak 
aurkezten ditugu. 

Horrez gain, tesi honek jarraian azaltzen diren eranskin hauek ere baditu, 
informazio gehigarria eta beste hizkuntza batzuetako laburpenekin: 

A. eranskina 143. orrialdean 

Emaitza Gehigarriak 

Gehigarri gisa, emaitza lagungarri batzuk aurkezten ditugu, tesiaren parte 
moduan sartu ez baditugu ere, ondorio batzuk frogatu eta ulertzeko balio 
dutenak. 

B. eranskina 145. orrialdean 

Hitz Nagusiak eta Definizioak 

Etiketa sozialen sistemekin zerikusia duten hainbat hitzen definizioa ema- 
ten dugu. 

C. eranskina 147. orrialdean 

Akronimoen Zerrenda 

Lanean zehar erabilitako akronimoen eta berauen esanahien zerrenda aur- 
kezten da. 

D. eranskina 149. orrialdean 

Resumen (Gaztelerazko Laburpena) 

Lan honen edukiaren gaztelerazko laburpena. 

E. eranskina 167. orrialdean 

Laburpena 

Lan honen edukiaren euskarazko laburpena. 
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E.5 Ebatzitako Ikerketa Galderak 

1. Ikerketa Galdera 

Zein SVM sailkatzaile mota erabili beharko litzateke sailkapen ataza hauek bu- 
rutzeko: jatorrizko klase-anitzeko sailkatzailea, ala sailkatzaile bitarren konbinazio 
bat? 

Jatorrizko kategoria-anitzeko SVM sailkatzaile bat erabiltzea sailkatzaile bi- 
tarra bateratzea baino askoz aproposagoa dela erakutsi dugu. Gure emaitzek 
argi eta garbi erakutsi dute kategoria-anitzeko taxonomien kasuan ez dela au- 
kera aproposa sailkatzaile bitarretan oinarritzea. Ondorioz, jatorrizko kategoria- 
anitzeko sailkatzaileak, kategoria guztiak aldi berean kontuan hartuz ataza osoa 
hobeto ezagutzen dutenak, egokiagoak dira errendimendu hobea lortzeko. 

2. Ikerketa Galdera 

Zein ikasketa motak ematen du errendimendu hobea sailkapen ataza hauek buru- 
tzeko: gainbegiratu batek, ala erdi-gainbegiratu batek? 

Teknika erdi-gainbegiratuek emaitza hobeak eskura ditzakete aurrez sailka- 
tutako baliabide sorta oso-oso txikia denean, baina teknika gainbegiratuek an- 
tzeko errendimendua lortzen dute baliabide gehixeago kontuan hartuz. Horrez 
gain, teknika gainbegiratuek konputazio aldetik gutxiago exijitzen dute. Beraz, 
Joachims (1999) egileak sailkapen bitarrerako erakutsitakoaren aurkakoa erakutsi 
dugu, ataza kategoria-anitzetarako teknika gainbegiratu eta erdi-gainbegiratuak 
oso antzerakoak direla, alegia. Zentzuzkoa dirudi kategoria kopurua handitu 
ahala zailagoa izatea teknika erdi-gainbegiratuen ikasketa behar bezala burutzea, 
gaizki sailkatutako baliabideek zarata gehitzen baitute ikasketa prozesuan. 

Ondorio hauei eutsiz, tesian zehar kategoria-anitzeko SVM gainbegiratua era- 
biltzea erabaki genuen. 

3. Ikerketa Galdera 

Nola eragiten dute etiketa sozialen sistemetako ezarpenek bertako erabiltzaileen 
anotazioetan eta ondorioz sortutako folksonomietan? 

Hau jakiteko, etiketa sozialen sistemetako ezarpenen ezaugarri ezberdinak 
aztertu ditugu. Aztertutako ezaugarrien artean, etiketak gomendatzeak duen ga- 
rrantzia nabaritu dugu, folksonomien egituran nabarmen eragiten baitu. Aztertu- 
tako etiketa sozialen sistemek ezarpen ezberdinak dituzte gomendioei dagokie- 
nean: 

• Baliabidean oinarritutako gomendioak (Delicious): baliabide bat etiketa- 
tzerako orduan, sistemak baliabide horretan beste erabiltzaile batzuek de- 
finitutako etiketak gomendatzen dituenean, erabiltzaileak etiketa berriak 
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definitzeko probabilitatea izugarri jaisten da, gehienetan gomendioetan oi- 
narritzen baitira. Kasu honetan, erabiltzaileek esfortzu txikia egiten dute 
etiketa berriak pentsatzen, eta gomendioei kasu egitea nahiago izaten du- 
te. 

• Pertsonomian oinarritutako gomendioak (GoodReads): baliabide bat eti- 
ketatzerako orduan, sistemak erabiltzaile horrek aurrez beste baliabide ba- 
tzuetan ipinitako etiketak gomendatzen dituenean, erabiltzailearen etiketa 
kopurua izugarri murrizten da. Erabiltzailearen berbategia askoz txikiagoa 
izaten da beraz. Hala eta guztiz ere, erabiltzaileek ez dakite beste batzuek 
zein etiketa ipini dizkioten baliabideari, eta baliabidearekiko berriak diren 
etiketak definitzeko probabilitatea mantendu egiten da. 

• Gomendiorik gabe (LibraryThing): baliabide bat etiketatzerako orduan, 
sistemak etiketarik gomendatzen ez duenean, erabiltzailearen berbategia 
hazi egiten da, eta baliabide bakoitzean etiketa berriak definitzeko proba- 
bilitatea mantendu egiten da. 

4. Ikerketa Galdera 

Zein da baliabide baten gainean erabiltzaileek egindako anotazio guztiak adierazpen 
bakarrean bateratzeko modurik egokiena? 

Erabiltzaile gehienek anotatu dituzten etiketa gutxi batzuetan oinarritu baino, 
etiketa guzti-guztiak kontuan hartzea merezi duela erakutsi dugu. Gehien anota- 
tutakoak dira garrantzitsuenak, eta baliabidea zeren ingurukoa den gehien adie- 
razten dutenak dira. Hala eta guztiz ere, erabiltzaile gutxik anotatutakoek ere 
badute nolabaiteko adierazgarritasuna, beste neurri batean bada ere, eta sailka- 
tzailearentzako baliagarria den informazioa eskaintzen dute. 

Baliabideen errepresentazioa egiterakoan etiketei emandako pisuei dagokio- 
nean, etiketa bakoitza definitu duen erabiltzaile kopurua pisu bezala erabiltzeare- 
na da emaitza onenak ematen dituena. Erabiltzaile kopurua aide batera utzi, edo 
baliabidea anotatu duen erabiltzaile guztien kopurua kontuan izatea bezalako 
beste hurbilketa batzuk gainditu ditu aurrekoak. 

Laburbilduz, aurkitu dugun errepresentazio egokiena etiketa guztiak erabili, 
eta etiketa bakoitza erabiltzaile kopuruaren arabera pisatzearena da. 

5. Ikerketa Galdera 

Etiketa sozialek ataza hauetarako duten balioaz gainera, merezi al du baliabidea- 
ren barne edukia bezalako beste datu iturri batzuk kontuan hartzea emaitzak are 
gehiago hobetzeko? 

Sailkatzaile ezberdinen aurreikuspenak elkartzen dituzten sailkatzaile bate- 
ratuetan oinarrituz, etiketek kontuan hartu beharreko iritziak ematen dituztela 
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erakutsi dugu. Iritzi hauek oso baliagarriak dira etiketa sozialak beste datu itu- 
rriekin bateratzeko. Edonola ere, datu iturri guztiak ez dira lagungarriak bate- 
ratzerako orduan. Aukeratutako datu iturriak nahikoa sendoak izan behar dira, 
aurreikuspen iritzi aproposak eman ditzaten. Datu iturriak ondo aukeratzen di- 
renean, baina, errendimendua nabarmen hobe daiteke. 

6. Ikerketa Galdera 

Baliabideak maila baxuagoko kategoria zehatzagoetan sailkatu ahal izateko nahikoa 
erabilgarriak eta zehatzak dira etiketa sozialak? 

Baliabideen sailkapenerako etiketa sozialean erabilgarritasuna taxonomien bi 
mailatan aztertu dugu. Goi-mailako kategoriez gainera, bigarren mailako kate- 
goria zehatzagoekin ere egin dugu ikerketa. Guneak erabiltzaileak anotatzera 
animatzen dituenean (Delicious eta LibraryThing-en), etiketa sozialek beste da- 
tu iturriek baino emaitza hobeak lortzen dituzte. Etiketek askogatik gainditzen 
dituzte beste datu iturriak kasu hauetan, Delicious-en batez ere, bigarren mai- 
lako emaitzek abantaila askoz garbiagoa erakusten baitute. Ezberdintasun hau 
LibraryThing-en ere gertatzen da. Azkenik, GoodReads-eko etiketek ez dituzte 
beste datu iturriak gainditzen, ezta goi-mailako kategorietan ere, sistemak ez di- 
tuelako erabiltzaileak etiketatzera animatzen, eta horrela anotazio gutxiago egiten 
direlako. 

Gure emaitza hauek Noll and Meinel (2008a) egileen hipotesia deusezten du- 
te. Beraiek egindako analisi estatistikoan, etiketek kategoria zehatzagoetan sail- 
kapenak egiteko balioko ez zutela uste zuten, eta horretarako beste datu iturri 
batzuk erabili behar ko liratekeela. 

7. Ikerketa Galdera 

Etiketen adierazgarritasuna neurtzera bidean, kolekzioan zehar etiketek duten dis- 
tribuzioa kontuan har al daiteke? 

Etiketen distribuzioak kontuan hartzea, IDFn oinarritutako pisu-funtzio ba- 
tean oinarrituz, baliabideen sailkapenerako etiketen adierazgarritasuna zehazte- 
ko interesgarria dela erakutsi dugu. Etiketa sozialen sistemaren ezarpenek, ordea, 
zerikusi handia dute distribuzio hauekin. Guneen ezarpenen artean baliabideetan 
oinarritutako gomendioek garrantzia handia dutela ikusi dugu, folksonomiaren 
egitura erabat aldatzen baitute. Gomendio hauek dituzten sistemek oso distribu- 
zio ezberdinak aurkezten dituzte. Honen arabera, pisu-funtzioen erabilgarritasu- 
na jakin daiteke. 

Gure sailkapen esperimentuetan ikusi ahal izan dugu pisu-funtzioek TF gain- 
ditzen dutela baliabideetan oinarritutako gomendioak existitzen ez direnean, hau 
da, LibraryThing eta GoodReads-en, bai bakarrik erabilita, eta baita beste datu 
iturri batzuekin elkartzerakoan ere. Halaber, aproposagoa da berauek bakarrik 
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erabiltzea, beste datu iturriekin elkartu gabe, emaitza hobeak lortzen baitira ho- 
rrela. 

Baliabideetan oinarritutako gomendioak ematen direnean, ordea, folksono- 
mien egitura oso ezberdina da, honek distribuzioetan eragiten du eta, ondorioz, 
baita pisu-funtzioetan ere. Hau dela-eta, pisu-funtzioak erabiltzerakoan emaitza 
txarragoak lortzen dira, eta beste datu iturriekin elkartu beharra dago hobetu 
ahal izateko. Elkartzerakoan, baina, TFk baino emaitza hobeak lortzen ditu, sail- 
katzailearen iritzi egokiei esker. 

8. Ikerketa Galdera 

Zein da etiketek kolekzioan duten adierazgarritasuna neurtzeko hurbilpenik egokie- 
na? 

Ikertutako pisu-funtzioen artean, laster-marka frekuentzietan oinarritzen de- 
nak lortzen ditu emaitza onenak sistemak baliabideetan oinarritutako gomen- 
dioak ematen ez dituenean. Kasu hauetan, IBF da onena, IRF eta IUFk jarraituta. 
Horiek guztiek argi eta garbi gainditzen dute TFren errendimendua, bai baka- 
rrik erabilita, eta baita sailkatzaile bateratuen bitartez beste datu iturri batzuekin 
elkartzerakoan ere. 

Bestalde, guneak baliabideetan oinarritutako gomendioak ematen dituenean, 
erabiltzaileen frekuentziak emaitza hobeak ematen ditu. IUFren errendimendua 
IBF eta IRFrena baino hobea da, mota honetako guneetako gomendioetan oi- 
narritu beharrean bere etiketa propioak definitzen dituzten erabiltzaileek duten 
garrantzia dela-eta. Bakarrik erabilita IUFk kasu honetan TF gainditzen ez badu 
ere, beste datu iturri batzuekin elkartzean emaitzarik onenak lortzen ditu. Hala 
ere, gutxigatik gainditzen du TFn oinarritutako sailkatzaile bateratuen emaitza, 
eta bietako edozein erabil liteke emaitza antzekoak eskuratuz. 

9. Ikerketa Galdera 

Ba al dago erabiltzaile profilak ezberdintzerik, sailkapen ataza batera ahalik eta 
gehien hurbiltzen diren anotazioak egiten dituzten erabiltzaileak bilatu asmoz? 

Erabiltzaile mota hori, Sailkatzaile izenekoa, existitzen dela frogatu dugu. 
Gure esperimentuen arabera, hau egia da, batez ere, etiketen gomendioak ez 
dituzten guneetan, hau da, LibraryThing-en. Gune honetan Sailkatzaileek defi- 
nitutako etiketek emaitza hobeak lortzen dituzte sailkapenerako. Gomendioak 
ematen direnean, ordea, erabiltzaile hauek antzematea zailagoa da, GoodReads 
eta Delicious-ekin gertatzen den bezala. Hala ere, erabiltzaileak antzemateko neu- 
rri aproposa erabiltzeak Sailkatzaileak antzematea ahalbidetzen du, kasu hauetan 
ere bai. 

10. Ikerketa Galdera 

Zeintzu dira erabiltzaile bat baliabideen sailkapen on bat egiten ari dela zehazten 
duten ezaugarriak? 
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Aztertutako bi ezaugarrien artean, sailkapen atazarako aproposenak diren 
Sailkatzaileak antzemateko ezaugarri interesgarriena erabiltzailearen hiztuntasu- 
na dela erakutsi dugu. Erabiltzaile batek etiketa gutxiago edo gehiago definitze- 
ko duen ohitura adierazten du hiztuntasunak. Ezaugarri hau baliatuz, posible da 
sailkapen atazatik gertuago dauden erabiltzaileak aurkitzea. Bestalde, erabiltzai- 
leen berbategiaren aniztasunaren arabera Deskribatzaileak diren erabiltzaileak 
antzeman daitezke. Honez gain, etiketa deskribatzaileak erabiltzen ez dituztenek 
sailkapen hobea sortzen dutela deskubritu dugu. 

E.6 Ekarpen Nagusiak 

Lan honen ideia berritzailea baliabideen sailkapenerako etiketa sozialak baliatze- 
an datza. Guk dakigula, etiketa sozialak baliatuz egiazko sailkapen esperimen- 
tuak burutzen dituen lehen lana guk aurkeztutako lehena da (Zubiaga et al., 
2009d). Horren aurretik, Noll and Meinel (2008a) egileek etiketa sozialak eta adi- 
tuen sailkapenak alderatu zituzten analisi estatistikoa eginez. Arlo honetako la- 
nen gabezia kontuan hartuz, tesi honetan aurkezten dugun lanak baliabideen 
sailkapenerako etiketa sozialen erabilpen eta errepresentazio aproposerako argi- 
penak ematen dira. Konkretuki, hurrengo ekarpen nagusiak aurkeztu ditugu lan 
honetan: 

• Tamaina handiko 3 kolekzio sortu ditugu etiketa sozialen sistemetan oi- 
narrituz, kontuan hartutako baliabideei adituek esleitutako sailkapen da- 
tuekin batera. 3 hauek ikerketan erabilitako datu-sorta handienen artean 
daudela esan genezake eta, guk dakigula, baliabideen sailkapenerako era- 
bilitako handienak dira. Datu-sorta hauetako batzuk, beste txikiago batzue- 
kin batera, publikoki eskuragarri utzi ditugu beste ikertzaile batzuek erabili 
ahal izan dezaten 11 . Datu-sorta hauek, besteak beste, Godoy and Amandi 
(2010) eta Strohmaier et al. (2010b) egileek baliatu dituzte beraien ikerketa 
lanetarako. 

• Gure lana etiketa sozialen errepresentazio ezberdinak alderatzen dituen 
lehena da. Gainera, etiketa sozialak eta beste datu iturri batzuk aldera- 
tuz egiazko sailkapen esperimentuak egiten dituen lehen ikerketa lana da. 
Etiketa sozialak goi-mailako kategoriatarako baizik, maila baxuagoko kate- 
goria zehatzagoetan sailkatzeko ere baliagarriak direla erakutsi dugu. Noll 
and Meinel (2008a) egileek ondorioztatuko hipotesia ezeztatzen dugu ho- 
nenbestez. Lan horretako analisi estatistikoaren arabera, kategoria zehatza- 
goetarako etiketen erabilgarritasuna oso txikia izan zitekeela diote egileek. 



http: / / nlp.uned.es/ social-tagging/ datasets/ 
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• Etiketa sozialek folksonomiatan dituzten distribuzioak aztertu ditugu, eta 
sistema bakoitzaren ezarpenek zentzu honetan duten eragina ikertu dugu. 
Horretarako, pisu-funtzio ezagun baten oinarritu gara, TF-IDF, folksono- 
mien egitura hauetara egokituz. 

• Sailkatzaile bezala definitu ditugun erabiltzaileek osatutako multzoa exis- 
titzen dela erakutsi dugu. Erabiltzaile hauen anotazioak gertuago daude 
adituen sailkapen taxonomikoetatik, Deskribatzaile deitu ditugun bere era- 
biltzaile batzuen anotazioetatik baino. Sailkatzaile eta Deskribatzaileak ez- 
berdintzeko hurbilketak lehendik ere frogatu baziren, hauxe da Sailkatzai- 
leak baliabideen sailkapenerako aproposagoak direla erakusten duen lehen 
lana. 

Etiketa sozialak baliabideen sailkapenerako erabiltzea ikerketa lerro berria 
zen tesi honekin hasi ginenean. Hala ere, azkenaldian sare sozialetan, eta berezi- 
ki etiketa sozialen sistemetan, erabiltzaileek sortutako edukien gainean ikertzeko 
sortu den interesa dela-eta, lan berri ugari ekarri du. Hazkunde honekin batera, 
ikertzaile gehiagok erakutsi du etiketa sozialak sailkapenerako erabiltzeko intere- 
sa, eta arlo honetan egindako ikerketa lanen kopuruak nabarmen egin du gora. 
Godoy and Amandi (2010) egileek, esate baterako, etiketak erabiltzen dituzte sail- 
kapenerako, gure aurreko lan baten oinarrituz (Zubiaga et al., 2009d). 

E.7 Etorkizunerako Ildoak 

Etiketa sozialak baliabideen sailkapenerako erabiltzea oraindik ere ikerketa arlo 
berria da, eta lan gutxi egin da honen inguruan. Tesi honetan aurkezten den lanak 
ahalik eta baliabideen sailkapen zehatzena lortzera bidean etiketa sozialak erre- 
presentatzeko modu egokia zein den argitzen du. Horrez gain, etorkizunerako 
ildo berriak ireki ditu. 

Tesi honetan guztian zehar, etiketa bakoitza ikur ezberdin bat bezala hartu 
dugu kontuan, izan dezakeen esanahi semantikoa aztertu gabe. Zentzu hone- 
tan, etorkizunerako lan interesgarria litzateke analisi semantikoa egitea etiketen 
artean dauden sinonimoak eta erlazio ezberdinak antzemateko. Lengoaia natura- 
len prozesamendurako teknika baliatuz, edo hurbilketa semantikoetara joz, etike- 
ten inguruko ezagutza areagotzea lortu liteke, folksonomien azterketa sakonagoa 
ahalbidetuz. 

6. Kapituluan erabili ditugun pisu-funtzioak testu kolekzioetan erabiltzeko 
pentsatutako TF-IDF funtzioan oinarritzen dira. Etorkizunerako interesgarria izan 
liteke beste funtzio batzuk probatzea, eta baita folksonomien egitura hauetara 
egokitu daitekeen beste funtzio batzuk proposatzea ere. Honek asko lagundu- 
ko luke gomendioak ematen dituzten sistemetarako, Delicious-en esate baterako, 
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izan ere probatu ditugun funtzioek ez baitute behar bezala funtzionatu sistema 
honetan. 



